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6.46 [1 month] <§§5.4, 6.3-6.8> If you have access to a simulation system such 
as Verilog or ViewLogic, first design the single-cycle datapath and control from 
Chapter 5. Then evolve this design into a pipelined organization, as we did in this 
chapter. Be sure to run MIPS programs at each step to ensure that your refined 
design continues to operate correctly. 

6.47 [10] <§6.9> The following code has been unrolled once but not yet sched- 
uled. Assume the loop index is a multiple of two (i.e., $10 is a multiple of eight): 

Loop: lw $2, 0($10) 

sub $4, $2, $3 

sw $4, 0($10) 

lw $5, 4($10) 

sub $6, $5, $3 

sw $6, 4($10) 

addi $10, $10, 8 

bne $10, $30, Loop 

Schedule this code for fast execution on the standard MIPS pipeline (assume that 
it supports addi instruction). Assume initially $10 is and $30 is 400 and that 
branches are resolved in the MEM stage. How does the scheduled code compare 
against the original unscheduled code? 

6.48 [20] <§6.9> This exercise is similar to Exercise 6.47, except this time the 
code should be unrolled twice (creating three copies of the code). However, it is 
not known that the loop index is a multiple of three, and thus you will need to 
invent a means of ensuring that the code still executes properly. (Hint: Consider 
adding some code to the beginning or end of the loop that takes care of the cases 
not handled by the loop.) 

6.49 [20] <§6.9> Using the code in Exercise 6.47, unroll the code four times and 
schedule it for the static multiple-issue version of the MIPS processor described on 
pages 436-439. You may assume that the loop executes for a multiple of four times. 

6.50 [10] <§§6.1-6.9> As technology leads to smaller feature sizes, the wires 
become relatively slower (as compared to the logic). As logic becomes faster with 
the shrinking feature size and clock rates increase, wire delays consume more clock 
cycles. That is why the Pentium 4 has several pipeline stages dedicated to transfer- 
ring data along wires from one part of the pipeline to another. What are the draw- 
backs to having to add pipe stages for wire delays? 

6.51 [30] <§6.10> New processors are introduced more quickly than new ver- 
sions of textbooks. To keep your textbook current, investigate some of the latest 
developments in this area and write a one-page elaboration to insert at the end of 
Section 6.10. Use the World-Wide Web to explore the characteristics of the lastest 
processors from Intel or AMD as a starting point. 
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§6.1, page 384: 1. Stall on the LW result. 2. Bypass the ADD result. 3. No stall or 

bypass required. 

§6.2, page 399: Statements 2 and 5 are correct; the rest are incorrect. 

§6.6, page 426: 1. Predict not taken. 2. Predict taken. 3. Dynamic prediction. 

§6.7, ® page 6.7-3: Statements 1 and 3 are both true. 

§6.7, ® page 6.7-7: Only statement #3 is completely accurate. 

§6.8, page 432: Only #4 is totally accurate. #2 is partially accurate. 

§6.9, page 447: Speculation: both; reorder buffer: hardware; register renaming: 

both; out-of-order execution: hardware; predication: software; branch prediction: 

both; VLIW: software; superscalar: hardware; EPIC: both, since there is substantial 

hardware support; multiple issue: both; dynamic scheduling: hardware. 

§6.10, page 450: All the statements are false. 



Answers to 
Check Yourself 



Computers 
in the 

Real World 



Mass Communication 
without Gatekeepers 



Problem to solve: Offer society sources of 
news and opinion beyond those found in the 
traditional mass media. 

Solution: Use the Internet and World Wide 
Web to select and publish nontraditional and 
nonlocal news sources. 

The Internet holds the promise of allowing 
citizens to communicate without the informa- 
tion first being interpreted by traditional mass 
media like television, newspapers, and maga- 
zines. To see what the future might be, we 
could look at countries that have widespread, 
high-speed Internet access. 

One place is South Korea. In 2002, 68% of 
South Korean households had broadband access, 
compared to 15% in the United States and 8% in 
Western Europe. (Broadband is generally digital 
subscriber loop or cable speeds, about 300 to 
1000 Kbps.) The main reason for the greater pen- 
etration is that 70% of households are in large cit- 
ies and almost half are found in apartments. 
Hence, the Korean telecommunications industry 
could afford to quickly offer broadband to 90% 
of the households. 

What was the impact of widespread high- 
speed access on Korean society? Internet news 



sites became extremely popular. One example 
is OhMyNews, which publishes articles from 
anyone after first checking that the facts in the 
article are correct. 

Many believe that Internet news services 
influenced the outcome of the 2002 Korean 
presidential election. First, they encouraged 
more young people to vote. Second, the win- 
ning candidate advocated politics that were 
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closer to those popular on the Internet news 
services. Together they overcame the disad- 
vantage that most major media organizations 
endorsed his opponent. 

Google News is another example of nontra- 
ditional access to news that goes beyond the 
mass media of one country. It searches inter- 
national news services for topics, and then 
summarizes and displays them by popularity. 
Rather than leaving the decision of what arti- 
cles should be on the front page to local news- 
paper editors, the worldwide media decides. 
In addition, by providing links to stories from 
many countries, the reader gets an interna- 
tional perspective rather than a local one. It 
also is updated many times a day unlike a daily 
newspaper. The figure below compares the 



New York Times front page to the Google News 
Web site on the same day. 

The widespread impact of these technolo- 
gies reminds us that computer engineers have 
responsibilities to their communities. We 
must be aware of societal values concerning 
privacy, security, free speech, and so on to 
ensure that new technological innovations 
enhance those values rather than inadvertently 
compromising them. 

To learn more see these references on 
the {§£ library 



D "Seriously wired" The Economist, April 17, 2003 
D OhMyNews, www.ohmynews.com 
D Google News, www.news.googIe.com 



New York Times Front Page Google News 




Judge Rules Out a Death Penalty for 9/11 

Suspect 

Rebuke for Justice Dept. 
Poll Shows Drop In Confidence on Bush 

Skill In Handling Crises 

Country on Wrong Track, Says Solid 

Majority 
Revised Admission tor High Schools 

City Says Students Will Get First Preference 
No Illicit Arms Found In Iraq, U.S. Inspector 

Tells Congress 
U.S. Practice How to Down Hijacked Jets 
Coetzee, Writer of Apartheid as Bleak Mirror, 

Wins Nobel 
Sexual Accusations Lead to an Apology 

by Schwarzenegger 
interim Chief Accepts Stock Exchange 

Shift 
Yankees Even with Twins 
Agency Warns of Fake Drugs 
Limbaugh Fallback Position 


Top Stories 

More than 1000 rallv behind Schwarzenegger 

AP - 5 minutes ago 

Maria Shriver defends husband CNN 
Can accusations hurt Arnold's campaign? 
KESG 

and 1252 related 

Bush: Hussein 'A Danqer to the World' ABC 
news - 5 hours ago 

Bush Stands By Decision Voice of America 
Hunt for weapons yields no evidence The 

Canberra Times 

and 598 related 

World Stories 

Defiant UN chief announces rival blueprint for 

Iraq 
The Times (UK) - 2 hours ago 

France, Russia Assail US Draft on Iraq 

Reuters 

and 782 related 





New York Times versus Google News on October 3, 2003 at 6 PM PT. The newspaper front page headlines must balance 
big stories with national news, local news, and sports. Google News has many stories per headline from around the world, with links the reader 
can follow. Google stories vary by time of day and hence are more recent. 




Large and Fast: 
Exploiting 

Memory Hierarchy 



Ideally one would desire an indefinitely large memory 
capacity such that any particular . . . word would be 

immediately available We are . . . forced to 

recognize the possibility of constructing a hierarchy of 
memories, each of which has greater capacity than the 
preceding but which is less quickly accessible. 

A. W. Burks, H. H. Goldstine, and J. von Neumann 

Preliminary Discussion of the Logical Design of an Electronic Computing Instrument, 1946 
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Chapter 7 Large and Fast: Exploiting Memory Hierarchy 



7.1 



Introduction 



temporal locality The princi- 
ple stating that if a data location 
is referenced then it will tend to 
be referenced again soon. 

spatial locality The locality 
principle stating that if a data 
location is referenced, data loca- 
tions with nearby addresses will 
tend to be referenced soon. 



From the earliest days of computing, programmers have wanted unlimited 
amounts of fast memory. The topics we will look at in this chapter aid program- 
mers by creating the illusion of unlimited fast memory. Before we look at how the 
illusion is actually created, let's consider a simple analogy that illustrates the key 
principles and mechanisms that we use. 

Suppose you were a student writing a term paper on important historical 
developments in computer hardware. You are sitting at a desk in a library with a 
collection of books that you have pulled from the shelves and are examining. You 
find that several of the important computers that you need to write about are 
described in the books you have, but there is nothing about the EDSAC. There- 
fore, you go back to the shelves and look for an additional book. You find a book 
on early British computers that covers EDSAC. Once you have a good selection of 
books on the desk in front of you, there is a good probability that many of the top- 
ics you need can be found in them, and you may spend most of your time just 
using the books on the desk without going back to the shelves. Having several 
books on the desk in front of you saves time compared to having only one book 
there and constantly having to go back to the shelves to return it and take out 
another. 

The same principle allows us to create the illusion of a large memory that we 
can access as fast as a very small memory. Just as you did not need to access all the 
books in the library at once with equal probability, a program does not access all 
of its code or data at once with equal probability. Otherwise, it would be impossi- 
ble to make most memory accesses fast and still have large memory in computers, 
just as it would be impossible for you to fit all the library books on your desk and 
still find what you wanted quickly. 

This principle of locality underlies both the way in which you did your work in 
the library and the way that programs operate. The principle of locality states that 
programs access a relatively small portion of their address space at any instant of 
time, just as you accessed a very small portion of the library's collection. There are 
two different types of locality: 

■ Temporal locality (locality in time): If an item is referenced, it will tend to 
be referenced again soon. If you recently brought a book to your desk to 
look at, you will probably need to look at it again soon. 

■ Spatial locality (locality in space): If an item is referenced, items whose 

addresses are close by will tend to be referenced soon. For example, when 
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you brought out the book on early English computers to find out about 
EDSAC, you also noticed that there was another book shelved next to it 
about early mechanical computers, so you also brought back that book too 
and, later on, found something useful in that book. Books on the same topic 
are shelved together in the library to increase spatial locality. We'll see how 
spatial locality is used in memory hierarchies a little later in this chapter. 

Just as accesses to books on the desk naturally exhibit locality, locality in pro- 
grams arises from simple and natural program structures. For example, most pro- 
grams contain loops, so instructions and data are likely to be accessed repeatedly, 
showing high amounts of temporal locality. Since instructions are normally 
accessed sequentially, programs show high spatial locality. Accesses to data also 
exhibit a natural spatial locality. For example, accesses to elements of an array or a 
record will naturally have high degrees of spatial locality. 

We take advantage of the principle of locality by implementing the memory of 
a computer as a memory hierarchy. A memory hierarchy consists of multiple lev- 
els of memory with different speeds and sizes. The faster memories are more 
expensive per bit than the slower memories and thus smaller. 

Today, there are three primary technologies used in building memory hierar- 
chies. Main memory is implemented from DRAM (dynamic random access 
memory), while levels closer to the processor (caches) use SRAM (static random 
access memory). DRAM is less costly per bit than SRAM, although it is substan- 
tially slower. The price difference arises because DRAM uses significantly less 
area per bit of memory, and DRAMs thus have larger capacity for the same 
amount of silicon; the speed difference arises from several factors described in 
Section B.8 of Appendix B. The third technology, used to implement the largest 
and slowest level in the hierarchy, is magnetic disk. The access time and price 
per bit vary widely among these technologies, as the table below shows, using 
typical values for 2004: 



memory hierarchy A struc- 
ture that uses multiple levels of 
memories; as the distance from 
the CPU increases, the size of 
die memories and the access 
time bodi increase. 



Memory technology 



Typical access time 



$ per GB in 2004 



SRAM 


0.5-5 ns 


$4000-$10,000 


DRAM 


50-70 ns 


$100-$ 200 


Magnetic disk 


5,000,000-20,000,000 ns 


$0.50-$2 



Because of these differences in cost and access time, it is advantageous to build 
memory as a hierarchy of levels. Figure 7.1 shows the faster memory is close to the 
processor and the slower, less expensive memory is below it. The goal is to present 
the user with as much memory as is available in the cheapest technology while 
providing access at the speed offered by the fastest memory. 
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Speed 




Size 



Cost (S/bit) 



Current 
Technology 



Fastest 




Smallest 



Highest 



SRAM 



Slowest 




DRAM 



Biggest 



Lowest 



Magnetic Disk 



FIGURE 7.1 The basic structure of a memory hierarchy. By implementing the memory system 
as a hierarchy, the user has the illusion of a memory that is as large as the largest level of the hierarchy, but 
can be accessed as if it were all built from the fastest memory. 



block The minimum unit of 
information that can be either 
present or not present in the 
two-level hierarchy. 



hit rate The fraction of mem 
ory accesses found in a cache. 



The memory system is organized as a hierarchy: a level closer to the processor 
is generally a subset of any level further away, and all the data is stored at the low- 
est level. By analogy, the books on your desk form a subset of the library you are 
working in, which is in turn a subset of all the libraries on campus. Furthermore, 
as we move away from the processor, the levels take progressively longer to access, 
just as we might encounter in a hierarchy of campus libraries. 

A memory hierarchy can consist of multiple levels, but data is copied between 
only two adjacent levels at a time, so we can focus our attention on just two levels. 
The upper level — the one closer to the processor — is smaller and faster (since it 
uses more expensive technology) than the lower level. Figure 7.2 shows that the 
minimum unit of information that can be either present or not present in the 
two-level hierarchy is called a block or a line; in our library analogy, a block of 
information is one book. 

If the data requested by the processor appears in some block in the upper level, 
this is called a hit (analogous to your finding the information in one of the books 
on your desk). If the data is not found in the upper level, the request is called a 
miss. The lower level in the hierarchy is then accessed to retrieve the block con- 
taining the requested data. (Continuing our analogy, you go from your desk to the 
shelves to find the desired book.) The hit rate, or hit ratio> is the fraction of mem- 
ory accesses found in the upper level; it is often used as a measure of the perfor- 
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FIGURE 7.2 Every pair of levels in the memory hierarchy can be thought of as having an 
upper and lower level. Within each level, the unit of information that is present or not is called a block. 
Usually we transfer an entire block when we copy something between levels. 



mance of the memory hierarchy. The miss rate (1 — hit rate) is the fraction of 
memory accesses not found in the upper level. 

Since performance is the major reason for having a memory hierarchy, the time 
to service hits and misses is important. Hit time is the time to access the upper 
level of the memory hierarchy, which includes the time needed to determine 
whether the access is a hit or a miss (that is, the time needed to look through the 
books on the desk). The miss penalty is the time to replace a block in the upper 
level with the corresponding block from the lower level, plus the time to deliver 
this block to the processor (or, the time to get another book from the shelves and 
place it on the desk). Because the upper level is smaller and built using faster 
memory parts, the hit time will be much smaller than the time to access the next 
level in the hierarchy, which is the major component of the miss penalty. (The 
time to examine the books on the desk is much smaller than the time to get up 
and get a new book from the shelves.) 

As we will see in this chapter, the concepts used to build memory systems 
affect many other aspects of a computer, including how the operating system 
manages memory and I/O, how compilers generate code, and even how appli- 
cations use the computer. Of course, because all programs spend much of 
their time accessing memory, the memory system is necessarily a major factor 
in determining performance. The reliance on memory hierarchies to achieve 
performance has meant that programmers, who used to be able to think of 
memory as a flat, random access storage device, now need to understand 



miss rate The fraction of 
memory accesses not found in a 
level of the memory hierarchy. 

hit time The time required to 
access a level of the memory 
hierarchy, including the time 
needed to determine whether 
the access is a hit or a miss. 

miss penalty The time 
required to fetch a block into a 
level of the memory hierarchy 
from the lower level, including 
the time to access the block, 
transmit it from one level to the 
other, and insert it in the level 
that experienced the miss. 
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memory hierarchies to get good performance. We show how important this 
understanding is with two examples. 

Since memory systems are so critical to performance, computer designers have 
devoted a lot of attention to these systems and developed sophisticated mecha- 
nisms for improving the performance of the memory system. In this chapter, we 
will see the major conceptual ideas, although many simplifications and abstrac- 
tions have been used to keep the material manageable in length and complexity. 
We could easily have written hundreds of pages on memory systems, as dozens of 
recent doctoral theses have demonstrated. 



Check 
Yourself 



Which of the following statements are generally true? 

1 . Caches take advantage of temporal locality. 

2. On a read, the value returned depends on which blocks are in the cache. 

3. Most of the cost of the memory hierarchy is at the highest level. 



The BIG 

Picture 



Programs exhibit both temporal locality, the tendency to reuse recently 
accessed data items, and spatial locality, the tendency to reference data 
items that are close to other recently accessed items. Memory hierarchies 
take advantage of temporal locality by keeping more recently accessed 
data items closer to the processor. Memory hierarchies take advantage of 
spatial locality by moving blocks consisting of multiple contiguous words 
in memory to upper levels of the hierarchy. 

Figure 7.3 shows that a memory hierarchy uses smaller and taster 
memory technologies close to the processor. Thus, accesses that hit in the 
highest level of the hierarchy can be processed quickly. Accesses that miss 
go to lower levels of the hierarchy, which are larger but slower. If the hit 
rate is high enough, the memory hierarchy has an effective access time 
close to that of the highest (and fastest) level and a size equal to that of 
the lowest (and largest) level. 

In most systems, the memory is a true hierarchy, meaning that data 
cannot be present in level i unless it is also present in level i + 1. 
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Increasing distance 

from the CPU in 

access time 



Levels in the 
memory hierarchy 



Size of the memory at each level 



FIGURE 7.3 This diagram shows the structure of a memory hierarchy: as the distance 
from the processor increases, so does the size. This structure with the appropriate operating 
mechanisms allows the processor to have an access time that is determined primarily by level 1 of the hier- 
archy and yet have a memory as large as level n. Maintaining this illusion is the subject of this chapter. 
Although the local disk is normally the bottom of the hierarchy, some systems use tape or a file server over a 
local area network as the next levels of the hierarchy. 



7.2 



The Basics of Caches 



In our library example, the desk acted as a cache — a safe place to store things 
(books) that we needed to examine. Cache was the name chosen to represent the 
level of the memory hierarchy between the processor and main memory in the 
first commercial computer to have this extra level. Today, although this remains 
the dominant use of the word cache, the term is also used to refer to any storage 
managed to take advantage of locality of access. Caches first appeared in research 
computers in the early 1960s and in production computers later in that same 
decade; every general-purpose computer built today, from servers to low-power 
embedded processors, includes caches. 

In this section, we begin by looking at a very simple cache in which the processor 
requests are each one word and the blocks also consist of a single word. (Readers 
already familiar with cache basics may want to skip to Section 7.3 on page 492.) 



Cache: a safe place for hid- 
ing or storing tilings. 

Webster's New World Diction- 
ary of the American Language, 
Third College Edition (1988) 
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71-2 



71-1 



x 4 



X 



n-2 



X 



n-1 




a. Before the reference to X 



n 



b. After the reference to X 



n 



FIGURE 7.4 The cache just before and just after a reference to a word X„ that is not 
initially in the cache. This reference causes a miss that forces the cache to fetch X,, from memory and 
insert it into the cache. 



direct-mapped cache A cache 
structure in which each memory 
location is mapped to exactly 
one location in the cache. 



Figure 7.4 shows such a simple cache, before and after requesting a data item that is 
not initially in the cache. Before the request, the cache contains a collection of recent 
references X ( , X 2 , . . . , X„ _ j, and the processor requests a word X„ that is not in the 
cache. This request results in a miss, and the word X M is brought from memory into 
cache. 

In looking at the scenario in Figure 7.4, there are two questions to 
answer: How do we know if a data item is in the cache? Moreover, if it is, how do 
we find it? The answers to these two questions are related. If each word can go in 
exactly one place in the cache, then it is straightforward to find the word if it is in 
the cache. The simplest way to assign a location in the cache for each word in 
memory is to assign the cache location based on the address of the word in mem- 
ory. This cache structure is called direct mapped, since each memory location is 
mapped directly to exactly one location in the cache. The typical mapping 
between addresses and cache locations for a direct-mapped cache is usually sim- 
ple. For example, almost all direct-mapped caches use the mapping 



(Block address) modulo (Number of cache blocks in the cache) 



This mapping is attractive because if the number of entries in the cache is a power 
of two, then modulo can be computed simply by using the low-order log 2 (cache 
size in blocks) bits of the address; hence the cache may be accessed directly with 
the low-order bits. For example, Figure 7.5 shows how the memory addresses 
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Cache 




00001 00101 



01001 01101 10001 

Memory 



10101 



11001 



11101 



FIGURE 7.5 A direct-mapped cache with eight entries showing the addresses of memory 
words between and 31 that map to the same cache locations. Because there are eight words in 
the cache, an address X maps to the cache word X modulo 8. That is, the low-order log2(8) = 3 bits are used as 
the cache index. Thus, addresses 00001^, 01001^. o , 10001 hvo > and ll001 hvo all map to entry 001^ of the 
cache, while addresses 00101 hvo > 01101 hvo > lOlOlf^, and 1 1101 hvo all map to entry 101 hvo of the cache. 



between l ten (OOOOI^q) and 29 ten (11 101 nvo ) map to locations l ten (OOI^q) and 
5 ten (101 nvo ) in a direct-mapped cache of eight words. 

Because each cache location can contain the contents of a number of different 
memory locations, how do we know whether the data in the cache corresponds to 
a requested word? That is, how do we know whether a requested word is in the 
cache or not? We answer this question by adding a set of tags to the cache. The 
tags contain the address information required to identify whether a word in the 
cache corresponds to the requested word. The tag needs only to contain the upper 
portion of the address, corresponding to the bits that are not used as an index into 
the cache. For example, in Figure 7.5 we need only have the upper 2 of the 5 
address bits in the tag, since the lower 3-bit index field of the address selects the 
block. We exclude the index bits because they are redundant, since by definition 
the index field of every address must have the same value. 

We also need a way to recognize that a cache block does not have valid infor- 
mation. For instance, when a processor starts up, the cache does not have good 



tag A field in a table used for a 
memory hierarchy that contains 
the address information required 
to identify whether the associated 
block in the hierarchy corre- 
sponds to a requested word. 
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valid bit A field in the tables of a 
memory hierarchy that indicates 
that the associated block in the 
hierarchy contains valid data. 



data, and the tag fields will be meaningless. Even after executing many instruc- 
tions, some of the cache entries may still be empty, as in Figure 7.4. Thus, we need 
to know that the tag should be ignored for such entries. The most common 
method is to add a valid bit to indicate whether an entry contains a valid address. 
If the bit is not set, there cannot be a match for this block. 

For the rest of this section, we will focus on explaining how reads work in a 
cache and how the cache control works for reads. In general, handling reads is a 
little simpler than handling writes, since reads do not have to change the contents 
of the cache. After seeing the basics of how reads work and how cache misses can 
be handled, we'll examine the cache designs for real computers and detail how 
these caches handle writes. 



Accessing a Cache 

Figure 7.6 shows the contents of an eight-word direct-mapped cache as it 
responds to a series of requests from the processor. Since there are eight blocks in 
the cache, the low-order 3 bits of an address give the block number. Here is the 
action for each reference: 



Decimal address 

of reference 


Binary address 
of reference 


Hit or miss 
in cache 


Assigned cache block 

(where found or placed) 


22 


1011CW, 


miss (7.6b) 


(lOHOtwo mod 8) = llOtwo 


26 


11010 tVTO 


miss (7.6c) 


(11010^ mod 8) = 010^ 


22 


10110 tVTO 


hit 


(10110^ mod 8) = 110^ 


26 


UMDbD 


hit 


(11010^ mod 8) = 010^ 


16 


10000^ 


miss (7.6d) 


(10000^ mod 8) = 000^ 


3 


00011^ 


miss (7.6e) 


(00011^ mod 8) = 011^ 


16 


10000 tVTO 


hit 


(10000^ mod 8) = 000^ 


18 


10010 tVTO 


miss (7.6f) 


(10010^ mod 8) = OlO^o 



When the word at address 18 (10010 two ) is brought into cache block 2 
(OIO^q), the word at address 26 (11010 two ), which was in cache block 2 
(OIO^q), must be replaced by the newly requested data. This behavior allows a 
cache to take advantage of temporal locality: recently accessed words replace 
less recently referenced words. This situation is directly analogous to needing a 
book from the shelves and having no more space on your desk — some book 
already on your desk must be returned to the shelves. In a direct-mapped cache, 
there is only one place to put the newly requested item and hence only one 
choice of what to replace. 
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FIGURE 7.6 The cache contents are shown after each reference request that misses, with the index and tag fields shown in 
binary. The cache is initially empty, with all valid bits (V entry in cache) turned off (N). The processor requests the following addresses: IOIIO^q 
(rniss), 11010^ (miss), 10110^ (hit), 11010 hvo (hit), 10000^ (miss), 00011 hvo (miss), 10000 hvo (hit), and 1001 O hvo (miss). The figures show the 
cache contents after each miss in the sequence has been handled. When address lOOlO^o (18) is referenced, the entry for address HOIO^^ (26) must 
be replaced, and a reference to 11010,^ will cause a subsequent miss. The tag field will contain only the upper portion of the address. The full address 
of a word contained in cache block ;' with tag field ; for this cache is/ x 8 + i, or equivalently the concatenation of the tag field/ and the index /.For 
example, in cache f above, index 010 has tag 10 and corresponds to address 10010. 
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We know where to look in the cache for each possible address: the low-order 
bits of an address can be used to find the unique cache entry to which the address 

could map. Figure 7.7 shows how a referenced address is divided into 

■ a cache index, which is used to select the block 

■ a tag field, which is used to compare with the value of the tag field of the 
cache 



Address (showing bit positions) 
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FIGURE 7.7 For this cache, the lower portion of the address is used to select a cache 
entry consisting of a data word and a tag. The tag from the cache is compared against the upper 
portion of the address to determine whether the entry in the cache corresponds to the requested address. 
Because the cache has 2 (or 1024) words and a block size of 1 word, 10 bits are used to index the cache, 
leaving 32 - 10 - 2 = 20 bits to be compared against the tag. If the tag and upper 20 bits of the address are 
equal and the valid bit is on, then the request hits in the cache, and the word is supplied to the processor. 
Otherwise, a miss occurs. 
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The index of a cache block, together with the tag contents of that block, uniquely 
specifies the memory address of the word contained in the cache block. Because 
the index field is used as an address to access the cache and because an n-bit field 
has 2" values, the total number of entries in a direct-mapped cache must be a 
power of two. In the MIPS architecture, since words are aligned to multiples of 4 
bytes, the least significant 2 bits of every address specify a byte within a word and 
hence are ignored when selecting the word in the block. 

The total number of bits needed for a cache is a function of the cache size and 
the address size because the cache includes both the storage for the data and the 
tags. The size of the block above was one word, but normally it is several. Assuming 
the 32-bit byte address, a direct-mapped cache of size 2" blocks with 2 m -word 
(2 m+ -byte) blocks will require a tag field whose size is 32 — (n + m + 2) bits 
because n bits are used for the index, m bits are used for the word within the block, 
and 2 bits are used for the byte part of the address. The total number of bits in a 
direct-mapped cache is 2" X (block size + tag size + valid field size). Since the block 
size is 2 m words (2 m+ bits) and the address size is 32 bits, the number of bits in 
such a cache is 2" X (m X 32 + (32 - n- m- 2) + I) = 2 B X (m X 32 + 31 - n- m). 
However, the naming convention is to excludes the size of the tag and valid field 
and to count only the size of the data. 



Bits in a Cache 

How many total bits are required for a direct-mapped cache with 16 KB of 
data and 4-word blocks, assuming a 32-bit address? 



We know that 16 KB is 4K words, which is 2 12 words, and, with a block size of 
4 words (2 2 ), 2 10 blocks. Each block has 4x 32 or 128 bits of data plus a tag, 
which is 32 - 10-2 — 2 bits, plus a valid bit. Thus, the total cache size is 

2 I0 X(128 + (32- 10-2-2) + 1) = 2 lo x 147 = 147 Kbits 

or 18.4 KB for a 16 KB cache. For this cache, the total number of bits in the 
cache is about 1.15 times as many as needed just for the storage of the data. 



EXAMPLE 
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EXAMPLE 



ANSWER 



Mapping an Address to a Multiword Cache Block 

Consider a cache with 64 blocks and a block size of 16 bytes. What block 
number does byte address 1200 map to? 



We saw the formula on page 474. The block is given by 

(Block address) modulo (Number of cache blocks) 

where the address of the block is 

Byte address 
Bytes per block 

Notice that this block address is the block containing all addresses between 



Byte address 

J 

Bytes per block 



X Bytes per block 



and 



Byte address 
Bytes per block_ 



x Bytes per block + ( Bytes per block - 1 ) 



Thus, with 16 bytes per block, byte address 1200 is block address 



1200 
16 



= 75 



which maps to cache block number (75 modulo 64) = 11. In fact, this block 
maps all addresses between 1200 and 1215. 



Larger blocks exploit spatial locality to lower miss rates. As Figure 7.8 shows, 
increasing the block size usually decreases the miss rate. The miss rate may go up 
eventually if the block size becomes a significant fraction of the cache size because 
the number of blocks that can be held in the cache will become small, and there 
will be a great deal of competition for those blocks. As a result, a block will be 
bumped out of the cache before many of its words are accessed. Stated alterna- 
tively, spatial locality among the words in a block decreases with a very large 
block; consequently, the benefits in the miss rate become smaller. 
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FIGURE 7.8 Miss rate versus block size. Note that miss rate actually goes up if the block size is too 
large relative to the cache size. Each line represents a cache of different size. (This figure is independent of 
associativity, discussed soon.) Unfortunately, SPEC2000 traces would take too long if block size were 
included, so these data are based on SPEC92. 



A more serious problem associated with just increasing the block size is that the 
cost of a miss increases. The miss penalty is determined by the time required to 
fetch the block from the next lower level of the hierarchy and load it into the 
cache. The time to fetch the block has two parts: the latency to the first word and 
the transfer time for the rest of the block. Clearly, unless we change the memory 
system, the transfer time — and hence the miss penalty — will increase as the block 
size grows. Furthermore, the improvement in the miss rate starts to decrease as 
the blocks become larger. The result is that the increase in the miss penalty over- 
whelms the decrease in the miss rate for large blocks, and cache performance thus 
decreases. Of course, if we design the memory to transfer larger blocks more effi- 
ciently, we can increase the block size and obtain further improvements in cache 
performance. We discuss this topic in the next section. 



Elaboration: The major disadvantage of increasing the block size is that the cache 
miss penalty increases. Although it is hard to do anything about the latency component 
of the miss penalty, we may be able to hide some of the transfer time so that the miss 
penalty is effectively smaller. The simplest method for doing this, called early restart, is 
simply to resume execution as soon as the requested word of the block is returned, 
rather than wait for the entire block. Many processors use this technique for instruction 
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access, where it works best Instruction accesses are largely sequential, so if the mem- 
ory system can deliver a word every clock cycle, the processor may be able to restart 
operation when the requested word is returned, with the memory system delivering new 
instruction words just in time. This technique is usually less effective for data caches 
because it is likely that the words will be requested from the block in a less predictable 
way, and the probability that the processor will need another word from a different 
cache block before the transfer completes is high. If the processor cannot access the 
data cache because a transfer is ongoing, then it must stall. 

An even more sophisticated scheme is to organize the memory so that the 
requested word is transferred from the memory to the cache first. The remainder of the 
block is then transferred, starting with the address after the requested word and wrap- 
ping around to the beginning of the block. This technique, called requested word first, or 
critical word first, can be slightly faster than early restart, but it is limited by the same 
properties that limit early restart. 



cache miss A request for data 
from the cache that cannot be 
filled because the data is not 
present in the cache. 



Handling Cache Misses 

Before we look at the cache of a real system, let's see how the control unit deals 
with cache misses. The control unit must detect a miss and process the miss by 
fetching the requested data from memory (or, as we shall see, a lower-level cache). 
If the cache reports a hit, the computer continues using the data as if nothing had 
happened. Consequently, we can use the same basic control that we developed in 
Chapter 5 and enhanced to accommodate pipelining in Chapter 6. The memories 
in the datapath in Chapters 5 and 6 are simply replaced by caches. 

Modifying the control of a processor to handle a hit is trivial; misses, however, 
require some extra work. The cache miss handling is done with the processor con- 
trol unit and with a separate controller that initiates the memory access and refills 
the cache. The processing of a cache miss creates a stall, similar to the pipeline stalls 
discussed in Chapter 6, as opposed to an interrupt, which would require saving the 
state of all registers. For a cache miss, we can stall the entire processor, essentially 
freezing the contents of the temporary and programmer-visible registers, while we 
wait for memory. In contrast, pipeline stalls, discussed in Chapter 6, are more com- 
plex because we must continue executing some instructions while we stall others. 

Let's look a little more closely at how instruction misses are handled for either 
the multicycle or pipelined datapath; the same approach can be easily extended to 
handle data misses. If an instruction access results in a miss, then the content of 
the Instruction register is invalid. To get the proper instruction into the cache, we 
must be able to instruct the lower level in the memory hierarchy to perform a 
read. Since the program counter is incremented in the first clock cycle of execu- 
tion in both the pipelined and multicycle processors, the address of the instruc- 
tion that generates an instruction cache miss is equal to the value of the program 
counter minus 4. Once we have the address, we need to instruct the main memory 
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to perform a read. We wait for the memory to respond (since the access will take 
multiple cycles), and then write the words into the cache. 

We can now define the steps to be taken on an instruction cache miss: 

1. Send the original PC value (current PC — 4) to the memory. 

2. Instruct main memory to perform a read and wait for the memory to com- 
plete its access. 

3. Write the cache entry, putting the data from memory in the data portion of 
the entry, writing the upper bits of the address (from the ALU) into the tag 
field, and turning the valid bit on. 

4. Restart the instruction execution at the first step, which will refetch the 
instruction, this time finding it in the cache. 

The control of the cache on a data access is essentially identical: on a miss, we 
simply stall the processor until the memory responds with the data. 

Handling Writes 

Writes work somewhat differently. Suppose on a store instruction, we wrote the 
data into only the data cache (without changing main memory); then, after the 
write into the cache, memory would have a different value from that in the cache. 
In such a case, the cache and memory are said to be inconsistent. The simplest way 
to keep the main memory and the cache consistent is to always write the data into 
both the memory and the cache. This scheme is called write-through. 

The other key aspect of writes is what occurs on a write miss. We first fetch the 
words of the block from memory. After the block is fetched and placed into the 
cache, we can overwrite the word that caused the miss into the cache block. We 
also write the word to main memory using the full address. 

Although this design handles writes very simply, it would not provide very good 
performance. With a write-through scheme, every write causes the data to be written 
to main memory. These writes will take a long time, likely at least 100 processor clock 
cycles, and could slow down the processor considerably. For the SPEC2000 integer 
benchmarks, for example, 10% of the instructions are stores. If the CPI without 
cache misses was 1.0, spending 100 extra cycles on every write would lead to a CPI of 
1.0+ lOOx 10%= 11, reducing performance by more than a factor of 10. 

One solution to this problem is to use a write buffer. A write buffer stores the 
data while it is waiting to be written to memory. After writing the data into the 
cache and into the write buffer, the processor can continue execution. When a 
write to main memory completes, the entry in the write buffer is freed. If the write 
buffer is full when the processor reaches a write, the processor must stall until 
there is an empty position in the write buffer. Of course, if the rate at which the 



write-through A scheme in 
which writes always update both 
the cache and the memory, 
ensuring that data is always con- 
sistent between the two. 



write buffer A queue that holds 
data while the data are waiting to 
be written to memory. 
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write-back A scheme that han- 
dles writes by updating values 
only to the block in the cache, 
then writing the modified block 
to the lower level of the hierar- 
chy when the block is replaced. 



memory can complete writes is less than the rate at which the processor is gener- 
ating writes, no amount of buffering can help because writes are being generated 
faster than the memory system can accept them. 

The rate at which writes are generated may also be less than the rate at which 
the memory can accept them, and yet stalls may still occur. This can happen when 
the writes occur in bursts. To reduce the occurrence of such stalls, processors usu- 
ally increase the depth of the write buffer beyond a single entry. 

The alternative to a write-through scheme is a scheme called write-back. In a 
write-back scheme, when a write occurs, the new value is written only to the block 
in the cache. The modified block is written to the lower level of the hierarchy 
when it is replaced. Write-back schemes can improve performance, especially 
when processors can generate writes as fast or faster than the writes can be han- 
dled by main memory; a write-back scheme is, however, more complex to imple- 
ment than write-through. 

In the rest of this section, we describe caches from real processors, and we 
examine how they handle both reads and writes. In Section 7.5, we will describe 
the handling of writes in more detail. 



Elaboration: Writes introduce several complications into caches that are not present 
for reads. Here we discuss two of them: the policy on write misses and efficient imple- 
mentation of writes in write-back caches. 

Consider a miss in a write-through cache. The strategy followed in most write- 
through cache designs, called fetch-on-miss, fetch-on-write, or sometimes allocateon- 
miss, allocates a cache block to the address that missed and fetches the rest of the 
block into the cache before writing the data and continuing execution. Alternatively, we 
could either allocate the block in the cache but not fetch the data (called no-fetchon- 
write), or even not allocate the block (called noallocate-on-write). Another name for 
these strategies that do not place the written data into the cache is write-around, since 
the data is written around the cache to get to memory. The motivation for these 
schemes is the observation that sometimes programs write entire blocks of data 
before reading them. In such cases, the fetch associated with the initial write miss may 
be unnecessary. There are a number of subtle issues involved in implementing these 
schemes in multiword blocks, including complicating the handling of write hits by requir- 
ing mechanisms similar to those used for write-back caches. 

Actually implementing stores efficiently in a cache that uses a write-back strategy is 
more complex than in a write-through cache. In a write-back cache, we must write the 
block back to memory if the data in the cache is dirty and we have a cache miss. If we 
simply overwrote the block on a store instruction before we knew whether the store had 
hit in the cache (as we could for a write-through cache), we would destroy the contents 
of the block, which is not backed up in memory. A write-through cache can write the 
data into the cache and read the tag; if the tag mismatches, then a miss occurs. 
Because the cache is write-through, the overwriting of the block in the cache is not cat- 
astrophic since memory has the correct value. 
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In a write-back cache, because we cannot overwrite the block, stores either require 
two cycles (a cycle to check for a hit followed by a cycle to actually perform the write) or 
require an extra buffer, called a store buffer, to hold that data — effectively allowing the 
store to take only one cycle by pipelining it. When a store buffer is used, the processor 
does the cache lookup and places the data in the store buffer during the normal cache 
access cycle. Assuming a cache hit, the new data is written from the store buffer into 
the cache on the next unused cache access cycle. 

By comparison, in a write-through cache, writes can always be done in one cycle. 
There are some extra complications with multiword blocks, however, since we cannot 
simply overwrite the tag when we write the data. Instead, we read the tag and write the 
data portion of the selected block. If the tag matches the address of the block being 
written, the processor can continue normally, since the correct block has been updated. 
If the tag does not match, the processor generates a write miss to fetch the rest of the 
block corresponding to that address. Because it is always safe to overwrite the data, 
write hits still take one cycle. 

Many write-back caches also include write buffers that are used to reduce the miss 
penalty when a miss replaces a dirty block. In such a case, the dirty block is moved to 
a write-back buffer associated with the cache while the requested block is read from 
memory. The write-back buffer is later written back to memory. Assuming another miss 
does not occur immediately, this technique halves the miss penalty when a dirty block 
must be replaced. 

An Example Cache: The Intrinsity FastMATH processor 

The Intrinsity FastMATH is a fast embedded microprocessor that uses the MIPS 
architecture and a simple cache implementation. Near the end of the chapter, we 
will examine the more complex cache design of the Intel Pentium 4, but we start 
with this simple, yet real, example for pedagogical reasons. Figure 7.9 shows the 
organization of the Intrinsity FastMATH data cache. 

This processor has 12-stage pipeline, similar to that discussed in Chapter 6. 
When operating at peak speed, the processor can request both an instruction 
word and a data word on every clock. To satisfy the demands of the pipeline with- 
out stalling, separate instruction and data caches are used. Each cache is 16 KB, or 
4K words, with 16-word blocks. 

Read requests for the cache are straightforward. Because there are separate data 
and instruction caches, separate control signals will be needed to read and write 
each cache. (Remember that we need to update the instruction cache when a miss 
occurs.) Thus, the steps for a read request to either cache are as follows: 

1. Send the address to the appropriate cache. The address comes either from 
the PC (for an instruction) or from the ALU (for data). 

2. If the cache signals hit, the requested word is available on the data lines. 
Since there are 16 words in the desired block, we need to select the right 
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FIGURE 7.9 The 16 KB caches in the Intensity FastMATH each contain 256 blocks with 16 words per block. The tag field is 18 bits 
wide and the index field is 8 bits wide, while a 4-bit field (bits 5-2) is used to index the block and select the word from the block using a 16-to-l multi- 
plexor. In practice, to eliminate the multiplexor, caches use a separate large RAM for the data and a smaller RAM for the tags, with the block offset supply- 
ing the extra address bits for the large data RAM. In this case, the large RAM is 32 bits wide and must have 16 times as many words as blocks in the cache. 



one. A block index field is used to control the multiplexor (shown at the 
bottom of the figure), which selects the requested word from the 16 words 
in the indexed block. 

3. If the cache signals miss, we send the address to the main memory. When 
the memory returns with the data, we write it into the cache and then read 
it to fulfill the request. 

For writes, the Intrinsity FastMATH offers both write-through and write-back, 
leaving it up to the operating system to decide which strategy to use for an appli- 
cation. It has a one-entry write buffer. 
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FIGURE 7.10 Approximate instruction and data miss rates for the Intrinsity FastMATH 
processor for SPEC2000 benchmarks. The combined miss rate is the effective miss rate seen for the 
combination of the 16 KB instruction cache and 16 KB data cache. It is obtained by weighting the instruc- 
tion and data individual miss rates by the frequency of instruction and data references. 

What cache miss rates are attained with a cache structure like that used by the 
Intrinsity FastMATH? Figure 7.10 shows the miss rates for the instruction and 
data caches for the SPEC2000 integer benchmarks. The combined miss rate is the 
effective miss rate per reference for each program after accounting for the differ- 
ing frequency of instruction and data accesses. 

Although miss rate is an important characteristic of cache designs, the ultimate 
measure will be the effect of the memory system on program execution time; we'll 
see how miss rate and execution time are related shortly. 

Elaboration: A combined cache with a total size equal to the sum of the two split 
caches will usually have a better hit rate. This higher rate occurs because the combined 
cache does not rigidly divide the number of entries that may be used by instructions 
from those that may be used by data. Nonetheless, many processors use a split 
instruction and data cache to increase cache bandwidth. 

Here are miss rates for caches the size of those found in the Intrinsity FastMATH 
processor, and for a combined cache whose size is equal to the total of the two caches: 

■ Total cache size: 32 KB 

■ Split cache effective miss rate: 3.24% 

■ Combined cache miss rate: 3.18% 

The miss rate of the split cache is only slightly worse. 

The advantage of doubling the cache bandwidth, by supporting both an instruction 
and data access simultaneously, easily overcomes the disadvantage of a slightly 
increased miss rate. This observation is another reminder that we cannot use miss rate 
as the sole measure of cache performance, as Section 7.3 shows. 



split cache A scheme in which 
a level of the memory hierarchy 
is composed of two independent 
caches that operate in parallel 
with each other with one 
handling instructions and one 
handling data. 



Designing the Memory System to Support Caches 

Cache misses are satisfied from main memory, which is constructed from 
DRAMs. In Section 7.1, we saw that DRAMs are designed with the primary 
emphasis on density rather than access time. Although it is difficult to reduce the 
latency to fetch the first word from memory, we can reduce the miss penalty if we 
increase the bandwidth from the memory to the cache. This reduction allows 



488 Chapter 7 Large and Fast: Exploiting Memory Hierarchy 



larger block sizes to be used while still maintaining a low miss penalty, similar to 
that for a smaller block. 

The processor is typically connected to memory over a bus. The clock rate of 
the bus is usually much slower than the processor, by as much as a factor of 10. 
The speed of this bus affects the miss penalty. 

To understand the impact of different organizations of memory, let's define a 
set of hypothetical memory access times. Assume 

■ 1 memory bus clock cycle to send the address 

■ 1 5 memory bus clock cycles for each DRAM access initiated 

■ 1 memory bus clock cycle to send a word of data 

If we have a cache block of four words and a one-word-wide bank of DRAMs, 
the miss penalty would bel + 4 X 15 + 4 X 1 = 65 memory bus clock cycles. Thus, 
the number of bytes transferred per bus clock cycle for a single miss would be 

4X4 



65 



= 0.25 



Figure 7.11 shows three options for designing the memory system. The first 
option follows what we have been assuming: memory is one word wide, and all 
accesses are made sequentially. The second option increases the bandwidth to 
memory by widening the memory and the buses between the processor and mem- 
ory; this allows parallel access to all the words of the block. The third option 
increases the bandwidth by widening the memory but not the interconnection 
bus. Thus, we still pay a cost to transmit each word, but we can avoid paying the 
cost of the access latency more than once. Let's look at how much these other two 
options improve the 65-cycle miss penalty that we would see for the first option 
(Figure 7.11a). 

Increasing the width of the memory and the bus will increase the memory 
bandwidth proportionally, decreasing both the access time and transfer time por- 
tions of the miss penalty. With a main memory width of two words, the miss pen- 
alty drops from 65 memory bus clock cycles to 1 + 2X15 + 2X 1 = 33 memory 
bus clock cycles. With a four-word-wide memory, the miss penalty is just 17 
memory bus clock cycles. The bandwidth for a single miss is then 0.48 (almost 
twice as high) bytes per bus clock cycle for a memory that is two words wide, and 
0.94 bytes per bus clock cycle when the memory is four words wide (almost four 
times higher). The major costs of this enhancement are the wider bus and the 
potential increase in cache access time due to the multiplexor and control logic 
between the processor and cache. 

Instead of making the entire path between the memory and cache wider, the 
memory chips can be organized in banks to read or write multiple words in one 
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FIGURE 7.11 The primary method of achieving higher memory bandwidth is to increase the physical or logical width of the 
memory system. In this figure, memory bandwidth is improved two ways. The simplest design, (a), uses a memory where all components are one 
word wide; (b) shows a wider memory, bus, and cache; while (c) shows a narrow bus and cache with an interleaved memory. In (b), the logic between 
the cache and processor consists of a multiplexor used on reads and control logic to update the appropriate words of the cache on writes. 



access time rather than reading or writing a single word each time. Each bank 
could be one word wide so that the width of the bus and the cache need not 
change, but sending an address to several banks permits them all to read simulta- 
neously. This scheme, which is called interleaving) retains the advantage of incur- 
ring the full memory latency only once. For example, with four banks, the time to 
get a four-word block would consist of 1 cycle to transmit the address and read 
request to the banks, 15 cycles for all four banks to access memory, and 4 cycles to 
send the four words back to the cache. This yields a miss penalty of 1 + 1 X 15 +4 
x 1 = 20 memory bus clock cycles. This is an effective bandwidth per miss of 0.80 
bytes per clock, or about three times the bandwidth for the one-word-wide mem- 
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ory and bus. Banks are also valuable on writes. Each bank can write indepen- 
dently, quadrupling the write bandwidth and leading to fewer stalls in a write- 
through cache. As we will see, an alternative strategy for writes makes interleaving 
even more attractive. 

Elaboration: Memory chips are organized to produce a number of output bits, usu- 
ally 4 to 32, with 8 or 16 being the most popular in 2004. We describe the organization 
of a RAM as d x w, where d is the number of addressable locations (the depth) and w is 
the output (or width of each location). One path to improving the rate at which we trans- 
fer data from the memory to the caches is to take advantage of the structure of 
DRAMs. DRAMs are logically organized as rectangular arrays, and access time is 
divided into row access and column access. DRAMs buffer a row of bits inside the 
DRAM for column access. They also come with optional timing signals that allow 
repeated accesses to the buffer without a row access time. This capability, originally 
called page mode, has gone through a series of enhancements. In page mode, the 
buffer acts like an SRAM; by changing column address, random bits can be accessed in 
the buffer until the next row access. This capability changes the access time signifi- 
cantly, since the access time to bits in the row is much lower. Figure 7.12 shows how 
the density, cost, and access time of DRAMS have changed over the years. 

The newest development is DDR SDRAMs (double data rate synchronous DRAMs). 
SDRAMs provide for a burst access to data from a series of sequential locations in the 
DRAM. An SDRAM is supplied with a starting address and a burst length. The data in 
the burst is transferred under control of a clock signal, which in 2004 can run at up to 



Year introduced Chip size 



Total access time to Column access 

$ per MB II a new row/ column time to existing row 



1980 


64 Kbit 


$1500 


250 ns 


150 ns 


1983 


256 Kbit 


$500 


185 ns 


100 ns 


1985 


1Mbit 


$200 


135 ns 


40 ns 


1989 


4 Mbit 


$50 


110 ns 


40 ns 


1992 


16 Mbit 


$15 


90 ns 


30 ns 


1996 


64 Mbit 


$10 


60 ns 


12 ns 


1998 


128 Mbit 


$4 


60 ns 


10 ns 


2000 


256 Mbit 


$1 


55 ns 


7 ns 


2002 


512 Mbit 


$0.25 


50 ns 


5 ns 


2004 


1024 Mbit 


$0.10 


45 ns 


3 ns 



FIGURE 7.12 DRAM size increased by multiples of four approximately once every three 
years until 1906, and thereafter doubling approximately every two years. The improve- 
ments in access time have been slower but continuous, and cost almost tracks density improvements, 
although cost is often affected by other issues, such as availability and demand. The cost per megabyte is not 
adjusted for inflation. 
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300 MHz. The two key advantages of SDRAMs are the use of a clock that eliminates 
the need to synchronize and the elimination of the need to supply successive 
addresses in the burst. The DDR part of the name means data transfers on both the 
leading and falling edge of the clock, thereby getting twice as much bandwidth as you 
might expect based on the clock rate and the data width. To deliver such high band- 
width, the internal DRAM is organized as interleaved memory banks. 

The advantage of these optimizations is that they use the circuitry already largely on 
the DRAMs, adding little cost to the system while achieving a significant improvement 
in bandwidth. The internal architecture of DRAMs and how these optimizations are 
implemented are described in Section B.8 of SS Appendix B. 



Summary 

We began the previous section by examining the simplest of caches: a direct-mapped 
cache with a one-word block. In such a cache, both hits and misses are simple, since a 
word can go in exactly one location and there is a separate tag for every word. To keep 
the cache and memory consistent, a write-through scheme can be used, so that every 
write into the cache also causes memory to be updated. The alternative to write- 
through is a write-back scheme that copies a block back to memory when it is 
replaced; we'll discuss this scheme further in upcoming sections. 

To take advantage of spatial locality, a cache must have a block size larger than 
one word. The use of a larger block decreases the miss rate and improves the effi- 
ciency of the cache by reducing the amount of tag storage relative to the amount 
of data storage in the cache. Although a larger block size decreases the miss rate, it 
can also increase the miss penalty. If the miss penalty increased linearly with the 
block size, larger blocks could easily lead to lower performance. To avoid this, the 
bandwidth of main memory is increased to transfer cache blocks more efficiently. 
The two common methods for doing this are making the memory wider and 
interleaving. In both cases, we reduce the time to fetch the block by minimizing 
the number of times we must start a new memory access to fetch a block, and, 
with a wider bus, we can also decrease the time needed to send the block from the 
memory to the cache. 



The speed of the memory system affects the designer's decision on the size of the 
cache block. Which of the following cache designer guidelines are generally valid? 

1. The shorter the memory latency, the smaller the cache block. 

2. The shorter the memory latency, the larger the cache block. 

3. The higher the memory bandwidth, the smaller the cache block. 

4. The higher the memory bandwidth, the larger the cache block. 



Check 
Yourself 
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7.3 



Measuring and Improving Cache 
Performance 



In this section, we begin by looking at how to measure and analyze cache perfor- 
mance; we then explore two different techniques for improving cache perfor- 
mance. One focuses on reducing the miss rate by reducing the probability that 
two different memory blocks will contend for the same cache location. The sec- 
ond technique reduces the miss penalty by adding an additional level to the hier- 
archy. This technique, called multilevel caching, first appeared in high-end 
computers selling for over $ 100,000 in 1990; since then it has become common on 
desktop computers selling for less than $1000! 

CPU time can be divided into the clock cycles that the CPU spends executing 
the program and the clock cycles that the CPU spends waiting for the memory 
system. Normally, we assume that the costs of cache accesses that are hits are part 
of the normal CPU execution cycles. Thus, 

CPU time = (CPU execution clock cycles + Memory-stall clock cycles) 

x Clock cycle time 

The memory-stall clock cycles come primarily from cache misses, and we make 
that assumption here. We also restrict the discussion to a simplified model of the 
memory system. In real processors, the stalls generated by reads and writes can be 
quite complex, and accurate performance prediction usually requires very 
detailed simulations of the processor and memory system. 

Memory-stall clock cycles can be defined as the sum of the stall cycles coming 
from reads plus those coming from writes: 

Memory-stall clock cycles = Read-stall cycles + Write-stall cycles 

The read-stall cycles can be defined in terms of the number of read accesses per 
program, the miss penalty in clock cycles for a read, and the read miss rate: 

Read-stall cycles = — — ■ x Read miss rate x Read miss penalty 

7 Program K 7 

Writes are more complicated. For a write-through scheme, we have two sources of 
stalls: write misses, which usually require that we fetch the block before continu- 
ing the write (see the Elaboration on page 484 for more details on dealing with 
writes), and write buffer stalls, which occur when the write buffer is full when a 
write occurs. Thus, the cycles stalled for writes equals the sum of these two: 
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Write-stall cycles = X Write miss rate X Write miss penalty 

VProgram / 

+ Write buffer stalls 

Because the write buffer stalls depend on the timing of writes, and not just the 
frequency, it is not possible to give a simple equation to compute such stalls. For- 
tunately, in systems with a reasonable write buffer depth (e.g., four or more 
words) and a memory capable of accepting writes at a rate that significantly 
exceeds the average write frequency in programs (e.g., by a factor of two), the 
write buffer stalls will be small, and we can safely ignore them. If a system did not 
meet these criteria, it would not be well designed; instead, the designer should have 
used either a deeper write buffer or a write-back organization. 

Write-back schemes also have potential additional stalls arising from the need 
to write a cache block back to memory when the block is replaced. We will discuss 
this more in Section 7.5. 

In most write-through cache organizations, the read and write miss penalties 
are the same (the time to fetch the block from memory). If we assume that the 
write buffer stalls are negligible, we can combine the reads and writes by using a 
single miss rate and the miss penalty: 

Memory-stall clock cycles = *— X Miss rate X Miss penalty 

Program 

We can also factor this as 



Memory-stall clock cycles = 



Instructions 



X 



Misses 



X Miss penalty 



Program Instruction 

Let's consider a simple example to help us understand the impact of cache perfor- 
mance on processor performance. 



Calculating Cache Performance 

Assume an instruction cache miss rate for a program is 2% and a data cache 
miss rate is 4%. If a processor has a CPI of 2 without any memory stalls and 
the miss penalty is 100 cycles for all misses, determine how much faster a pro- 
cessor would run with a perfect cache that never missed. Use the instruction 
frequencies for SPECint2000 from Chapter 3, Figure 3.26, on page 228. 



EXAMPLE 
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The number of memory miss cycles for instructions in terms of the Instruc- 
tion count (I) is 

Instruction miss cycles = I X 2% X 100 = 2.00 X I 

The frequency of all loads and stores in SPECint2000 is 36%. Therefore, we can 
find the number of memory miss cycles for data references: 

Data miss cycles = I X 36% X 4% X 100 = 1.44 X I 

The total number of memory-stall cycles is 2.00 I + 1.44 I = 3.44 I. This is 
more than 3 cycles of memory stall per instruction. Accordingly, the CPI with 
memory stalls is 2 + 3.44 = 5.44. Since there is no change in instruction count 
or clock rate, the ratio of the CPU execution times is 

CPU time with stalls _ j x CPI staU X Clock cycle 

CPU time with perfect cache I X CPI per f ect x Clock cycle 



CPI 



tall 



5.44 



CPI 



perfect 



The performance with the perfect cache is better by 



5.44 



= 2.72. 



What happens if the processor is made faster, but the memory system is not? 
The amount of time spent on memory stalls will take up an increasing fraction of 
the execution time; Amdahl's law, which we examined in Chapter 4, reminds us of 
this fact. A few simple examples show how serious this problem can be. Suppose 
we speed up the computer in the previous example by reducing its CPI from 2 to 1 
without changing the clock rate, which might be done with an improved pipeline. 
The system with cache misses would then have a CPI of 1 + 3.44 = 4.44, and the 
system with the perfect cache would be 



4.44 



= 4.44 times faster 



The amount of execution time spent on memory stalls would have risen from 



3.44 
5.44 



= 63% 



to 
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3.44 
4.44 



= 77% 



Similarly, increasing clock rate without changing the memory system also 
increases the performance lost due to cache misses, as the next example shows. 



Cache Performance with Increased Clock Rate 

Suppose we increase the performance of the computer in the previous exam- 
ple by doubling its clock rate. Since the main memory speed is unlikely to 
change, assume that the absolute time to handle a cache miss does not 
change. How much faster will the computer be with the faster clock, assum- 
ing the same miss rate as the previous example? 



Measured in the faster clock cycles, the new miss penalty will be twice as 
many clock cycles, or 200 clock cycles. Hence: 

Total miss cycles per instruction = (2% X 200) + 36% X (4% X 200) = 6.88 

Thus, the faster computer with cache misses will have a CPI of 2 + 6.88 = 
8.88, compared to a CPI with cache misses of 5.44 for the slower computer. 

Using the formula for CPU time from the previous example, we can compute 
the relative performance as 

Performance with fast clock Execution time with slow clock 



Performance with slow clock Execution time with fast clock 



IC x CPI 



slow clock 



X Clock cycle 



ICxCPI fastclock x 



Clock cycle 



5.44 



8.88 X - 
2 



= 1.23 



Thus, the computer with the faster clock is about 1.2 times faster rather than 
2 times faster, which it would have been if we ignored cache misses. 



EXAMPLE 



ANSWER 
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As these examples illustrate, relative cache penalties increase as a processor 
becomes faster. Furthermore, if a processor improves both clock rate and CPI, it 
suffers a double hit: 

1. The lower the CPI, the more pronounced the impact of stall cycles. 

2. The main memory system is unlikely to improve as fast as processor cycle 
time, primarily because the performance of the underlying DRAM is not 
getting much faster. When calculating CPI, the cache miss penalty is mea- 
sured in processor clock cycles needed for a miss. Therefore, if the main 
memories of two processors have the same absolute access times, a higher 
processor clock rate leads to a larger miss penalty. 

Thus, the importance of cache performance for processors with low CPI and 
high clock rates is greater, and consequently the danger of neglecting cache 
behavior in assessing the performance of such processors is greater. As we will 
see in Section 7.6, the use of fast, pipelined processors in desktop PCs and 
workstations has led to the use of sophisticated cache systems even in comput- 
ers selling for less than a $1000. 

The previous examples and equations assume that the hit time is not a fac- 
tor in determining cache performance. Clearly, if the hit time increases, the 
total time to access a word from the memory system will increase, possibly 
causing an increase in the processor cycle time. Although we will see addi- 
tional examples of what can increase hit time shortly, one example is increas- 
ing the cache size. A larger cache could clearly have a longer access time, just 
as if your desk in the library was very large (say, 3 square meters), it would 
take longer to locate a book on the desk. With pipelines deeper than five 
stages, an increase in hit time likely adds another stage to the pipeline, since it 
may take multiple cycles for a cache hit. Although it is more complex to calcu- 
late the performance impact of a deeper pipeline, at some point the increase in 
hit time for a larger cache could dominate the improvement in hit rate, lead- 
ing to a decrease in processor performance. 

The next subsection discusses alternative cache organizations that decrease 
miss rate but may sometimes increase hit time; additional examples appear in Fal- 
lacies and Pitfalls (Section 7.7). 

Reducing Cache Misses by More Flexible Placement 
of Blocks 

So far, when we place a block in the cache, we have used a simple placement 
scheme: A block can go in exactly one place in the cache. As mentioned earlier, it 
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is called direct mapped because there is a direct mapping from any block address in 
memory to a single location in the upper level of the hierarchy. There is actually a 
whole range of schemes for placing blocks. At one extreme is direct mapped, 
where a block can be placed in exactly one location. 

At the other extreme is a scheme where a block can be placed in any location in 
the cache. Such a scheme is called fully associative because a block in memory 
may be associated with any entry in the cache. To find a given block in a fully asso- 
ciative cache, all the entries in the cache must be searched because a block can be 
placed in any one. To make the search practical, it is done in parallel with a com- 
parator associated with each cache entry. These comparators significantly increase 
the hardware cost, effectively making fully associative placement practical only for 
caches with small numbers of blocks. 

The middle range of designs between direct mapped and fully associative is 
called set associative. In a set-associative cache, there are a fixed number of 
locations (at least two) where each block can be placed; a set-associative cache 
with n locations for a block is called an /i-way set-associative cache. An /7-way 
set-associative cache consists of a number of sets, each of which consists of n 
blocks. Each block in the memory maps to a unique set in the cache given by the 
index field, and a block can be placed in any element of that set. Thus, a set- 
associative placement combines direct-mapped placement and fully associative 
placement: a block is directly mapped into a set, and then all the blocks in the 
set are searched for a match. 

Remember that in a direct-mapped cache, the position of a memory block is 
given by 



fully associative cache A 
cache structure in which a block 
can be placed in any location in 
the cache. 



set-associative cache A cache 
that has a fixed number of loca- 
tions (at least two) where each 
block can be placed. 



(Block number) modulo (Number of cache blocks) 
In a set-associative cache, the set containing a memory block is given by 



(Block number) modulo (Number of sets in the cache 



Since the block may be placed in any element of the set, all the tags of all the ele- 
ments of the set must be searched. In a fully associative cache, the block can go 
anywhere and all tags of all the blocks in the cache must be searched. For example, 
Figure 7.13 shows where block 12 may be placed in a cache with eight blocks total, 
according to the block placement policy for direct-mapped, two-way set-associa- 
tive, and fully associative caches. 

We can think of every block placement strategy as a variation on set asso- 
ciativity. Figure 7.14 shows the possible associativity structures for an eight-block 
cache. A direct-mapped cache is simply a one-way set-associative cache: each 
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Direct mapped 



Set associative 



Fully associative 



Block # 01234567 



SelS 1 2 3 



Data 



Data 



Tag 
Search 











1 

2 









Tag 

Search 



4 i 



Data 





1 

2 















Tag 



Search 















1 

2 





4 4 4 4 



FIGURE 7.13 The location of a memory block whose address is 12 in a cache with 8 blocks varies for direct-mapped, set- 
associative, and fully associative placement. In direct-mapped placement, there is only one cache block where memory block 12 can be 
found, and that block is given by (12 modulo 8) = 4. In a two-way set -associative cache, there would be four sets, and memory block 12 must be in set 
(12 mod 4) = 0; the memory block could be in either element of the set. In a fully associative placement, the memory block for block address 12 can 
appear in any of the eight cache blocks. 



cache entry holds one block and each set has one element. A fully associative 
cache with m entries is simply an m-way set-associative cache; it has one set with 
m blocks, and an entry can reside in any block within that set. 

The advantage of increasing the degree of associativity is that it usually 
decreases the miss rate, as the next example shows. The main disadvantage, which 
we discuss in more detail shortly, is an increase in the hit time. 
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One-way set associative 
(direct mapped) 
Block Tag Data 










1 


1 






i 


2 






Set 


3 








• 


4 






1 


5 






2 


6 






3 


7 









Two-way set associative 



Tag 


Data 


Tag 


Data 



































Set 

1 



Four-way set associative 
Tag Data Tag Data Tag Data Tag Data 



Eight-way set associative (fully associative) 



Tag 


Data 


Tag 


Data 


Tag 


Data 


Tag 


Data 


Tag 


Data 


Tag 


Data 


Tag 


Data 


Tag 


Data 



































FIGURE 7.14 An eight-block cache configured as direct mapped, two-way set associa- 
tive, four-way set associative, and fully associative. The total size of the cache in blocks is equal 
to the number of sets times the associativity. Thus, for a fixed cache size, increasing the associativity 
decreases the number of sets, while increasing the number of elements per set. With eight blocks, an eight- 
way set -associative cache is the same as a fully associative cache. 



Misses and Associativity in Caches 

Assume there are three small caches, each consisting of four one-word blocks. 
One cache is fully associative, a second is two-way set associative, and the 
third is direct mapped. Find the number of misses for each cache organiza- 
tion given the following sequence of block addresses: 0, 8, 0, 6, 8. 



EXAMPLE 
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The direct-mapped case is easiest. First, let's determine to which cache block 
each block address maps: 



Block address Cache block || 





(0 modulo 4) = 


6 


(6 modulo 4) = 2 


8 


(8 modulo 4) = 



Now we can fill in the cache contents after each reference, using a blank entry 
to mean that the block is invalid, colored text to show a new entry added to 
the cache for the associate reference, and a plain text to show an old entry in 
the cache: 



Address of memory 
block accessed 




Contents of cache blocks after reference 



or miss 







miss 


Memory[0] 








8 


miss 


Memory[8] 











miss 


Memory[0] 








6 


miss 


Memory[0] 




Memory[6] 




8 


miss 


Memory[8] 




Memory[6] 





The direct-mapped cache generates five misses for the five accesses. 

The set-associative cache has two sets (with indices and 1) with two ele 
ments per set. Let's first determine to which set each block address maps: 



Block address I Cache set 






(0 modulo 2) = 


6 


(6 modulo 2) = 


8 


(8 modulo 2) = 



Because we have a choice of which entry in a set to replace on a miss, we need 
a replacement rule. Set-associative caches usually replace the least recently 
used block within a set; that is, the block that was used furthest in the past is 
replaced. (We will discuss replacement rules in more detail shortly.) Using 
this replacement rule, the contents of the set-associative cache after each ref- 
erence looks like this: 
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Address of memory 
block accessed 


Hit 


Contents of cache blocks after reference [| 


or miss 


SetO 


SetO 


Setl 


Setl 1 





miss 


Memory[0] 








8 


miss 


Memory[0] 


Memoiy[8] 









hit 


Memory[0] 


Memory[8] 






6 


miss 


Memory[0] 


Memory[6] 






8 


miss 


Memory[8] 


Memory [6] 







Notice that when block 6 is referenced, it replaces block 8, since block 8 has 
been less recently referenced than block 0. The two-way set-associative cache 
has four misses, one less than the direct-mapped cache. 

The fully associative cache has four cache blocks (in a single set); any memo- 
ry block can be stored in any cache block. The fully associative cache has the 
best performance, with only three misses: 



Address of memory 
block accessed 


Hit 


Contents of cache blocks after reference | 


or miss 


Block 


Block 1 


Block 2 


Block 3 1 





miss 


Memory[0] 








8 


miss 


Memory[0] 


Memory[8] 









hit 


Memory[0] 


Memory[8] 






6 


miss 


Memory[0] 


Memory[8] 


Mernory[6] 




8 


hit 


Memory[0] 


Memory[8] 


Memory[6] 





For this series of references, three misses is the best we can do because three 
unique block addresses are accessed. Notice that if we had eight blocks in the 
cache, there would be no replacements in the two-way set-associative cache 
(check this for yourself), and it would have the same number of misses as the 
fully associative cache. Similarly, if we had 16 blocks, all three caches would 
have the same number of misses. This change in miss rate shows us that cache 
size and associativity are not independent in determining cache performance. 



How much of a reduction in the miss rate is achieved by associativity? Figure 7. 15 
shows the improvement for the SPEC2000 benchmarks for a 64 KB data cache with 
a 16-word block, and associativity ranging from direct mapped to eight-way. Going 
from one-way to two-way associativity decreases the miss rate by about 15%, but 
there is little further improvement in going to higher associativity. 
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Associativity 



Data miss rate 



1 


10.3% 


2 


8.6% 


4 


8.3% 


8 


8.1% 



FIGURE 7.15 The data cache miss rates for an organization like the Intrinsity FastMATH 
processor for SPEC2000 benchmarks with associativity varying from one-way to eight- 
way. These results for 10 SPEC2000 programs are from Hennessy and Patterson [2003]. 

Locating a Block in the Cache 

Now, let's consider the task of finding a block in a cache that is set associative. lust 
as in a direct-mapped cache, each block in a set-associative cache includes an 
address tag that gives the block address. The tag of every cache block within the 
appropriate set is checked to see if it matches the block address from the proces- 
sor. Figure 7.16 shows how the address is decomposed. The index value is used to 
select the set containing the address of interest, and the tags of all the blocks in the 
set must be searched. Because speed is of the essence, all the tags in the selected set 
are searched in parallel. As in a fully associative cache, a sequential search would 
make the hit time of a set-associative cache too slow. 

If the total cache size is kept the same, increasing the associativity increases 
the number of blocks per set, which is the number of simultaneous compares 
needed to perform the search in parallel: each increase by a factor of two in 
associativity doubles the number of blocks per set and halves the number of 
sets. Accordingly, each factor-of-two increase in associativity decreases the size 
of the index by 1 bit and increases the size of the tag by 1 bit. In a fully associa- 
tive cache, there is effectively only one set, and all the blocks must be checked in 
parallel. Thus, there is no index, and the entire address, excluding the block off- 
set, is compared against the tag of every block. In other words, we search the 
entire cache without any indexing. 

In a direct-mapped cache, such as in Figure 7.7 on page 478, only a single com- 
parator is needed, because the entry can be in only one block, and we access the 
cache simply by indexing. Figure 7.17 shows that in a four-way set- associative 
cache, four comparators are needed, together with a 4-to-l multiplexor to choose 



Tag 


Index 


Block Offset 



FIGURE 7.16 The three portions of an address in a set-associative or direct-mapped 
cache. The index is used to select the set, then the tag is used to choose the block by comparison with the 
blocks in the selected set. The block offset is the address of the desired data within the block. 
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Address 
31 30 »♦ 12 11 10 98 — 3 2 1 



Tag 



22 



Index 



8 



V Tag Data 



V Tag 



V Tag Data 



V Tag Data 




4-to-1 multiplexor J 



FIGURE 7.17 The implementation of a four-way set-associative cache requires four comparators and a 4-to-l multiplexor. 

The comparators determine which element of the selected set (if any) matches the tag. The output of the comparators is used to select the data from 
one of the four blocks of the indexed set, using a multiplexor with a decoded select signal. In some implementations, the Output enable signals on the 
data portions of the cache RAMs can be used to select the entry in the set that drives the output. The Output enable signal comes from the compara- 
tors, causing the element that matches to drive the data outputs. This organization eliminates the need for the multiplexor. 



among the four potential members of the selected set. The cache access consists of 
indexing the appropriate set and then searching the tags of the set. The costs of an 
associative cache are the extra comparators and any delay imposed by having to 
do the compare and select from among the elements of the set. 

The choice among direct-mapped, set-associative, or fully associative mapping 
in any memory hierarchy will depend on the cost of a miss versus the cost of 
implementing associativity, both in time and in extra hardware. 
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Size of Tags versus Set Associativity 

Increasing associativity requires more comparators, and more tag bits per cache 
block. Assuming a cache of 4K blocks, a four-word block size, and a 32-bit ad- 
dress, find the total number of sets and the total number of tag bits for caches that 
are direct mapped, two-way and four-way set associative, and fully associative. 



Since there are 16 (=2 ) bytes per block, a 32-bit address yields 32 — 4 = 28 
bits to be used for index and tag. The direct-mapped cache has the same 
number of sets as blocks, and hence 12 bits of index, since log 2 (4K) = 12; 
hence, the total number of tag bits is (28 - 12) x 4K = 16 X 4K = 64 Kbits. 

Each degree of associativity decreases the number of sets by a factor of two and 
thus decreases the number of bits used to index the cache by one and increases the 
number of bits in the tag by one. Thus, for a two-way set-associative cache, there 
are 2K sets, and the total number of tag bits is (28 -1 1 ) X 2x 2K= 34 X 2K = 68 Kbits. 
For a four-way set-associative cache, the total number of sets is IK, and the total 
number of tag bits is (28 - 10) x 4 x IK = 72 x IK = 11 Kbits. 

For a fully associative cache, there is only one set with 4K blocks, and the tag 
is 28 bits, leading to a total of 28 x 4K X 1 = 1 12K tag bits. 



least recently used (LRU) A 
replacement scheme in which 
the block replaced is the one 
that has been unused for the 
longest time. 



Choosing Which Block to Replace 

When a miss occurs in a direct-mapped cache, the requested block can go in 
exactly one position, and the block occupying that position must be replaced. In 
an associative cache, we have a choice of where to place the requested block, and 
hence a choice of which block to replace. In a fully associative cache, all blocks are 
candidates for replacement. In a set-associative cache, we must choose among the 
blocks in the selected set. 

The most commonly used scheme is least recently used (LRU), which we used 
in the previous example. In an LRU scheme. The block replaced is the one that has 
been unused for the longest time. LRU replacement is implemented by keeping 
track of when each element in a set was used relative to the other elements in the 
set. For a two-way set-associative cache, tracking when the two elements were 
used can be implemented by keeping a single bit in each set and setting the bit to 
indicate an element whenever that element is referenced. As associativity 
increases, implementing LRU gets harder; in Section 7.5, we will see an alternative 
scheme for replacement. 
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Reducing the Miss Penalty Using Multilevel Caches 

All modern computers make use of caches. In most cases, these caches are imple- 
mented on the same die as the microprocessor that forms the processor. To fur- 
ther close the gap between the fast clock rates of modern processors and the 
relatively long time required to access DRAMs, many microprocessors support an 
additional level of caching. This second-level cache, which can be on the same 
chip or off-chip in a separate set of SRAMs, is accessed whenever a miss occurs in 
the primary cache. If the second-level cache contains the desired data, the miss 
penalty for the first-level cache will be the access time of the second-level cache, 
which will be much less than the access time of main memory. If neither the pri- 
mary nor secondary cache contains the data, a main memory access is required, 
and a larger miss penalty is incurred. 

How significant is the performance improvement from the use of a secondary 
cache? The next example shows us. 



Performance of Multilevel Caches 

Suppose we have a processor with a base CPI of 1.0, assuming all references 
hit in the primary cache, and a clock rate of 5 GHz. Assume a main memory 
access time of 100 ns, including all the miss handling. Suppose the miss rate 
per instruction at the primary cache is 2%. How much faster will the proces- 
sor be if we add a secondary cache that has a 5 ns access time for either a hit 
or a miss and is large enough to reduce the miss rate to main memory to 
0.5%? 



The miss penalty to main memory is 



100 ns 



0.2 



IIS 



= 500 clock cycles 



clock cycle 



EXAMPLE 



ANSWER 
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The effective CPI with one level of caching is given by 

Total CPI = Base CPI + Memory-stall cycles per instruction 

For the processor with one level of caching, 

Total CPI = 1.0 + Memory-stall cycles per instruction = 1.0 + 2% x 500 = 1 1.0 

With two levels of cache, a miss in the primary (or first-level) cache can be 
satisfied either by the secondary cache or by main memory. The miss penalty 
for an access to the second-level cache is 



5 ns 



0.2 



ns 



= 25 clock cycles 



clock cycle 



If the miss is satisfied in the secondary cache, then this is the entire miss penal- 
ty. If the miss needs to go to main memory, then the total miss penalty is the 
sum of the secondary cache access time and the main memory access time. 

Thus, for a two-level cache, total CPI is the sum of the stall cycles from both 
levels of cache and the base CPI: 

Total CPI = 1 + Primary stalls per instruction 

+ Secondary stalls per instruction 

= 1 + 2% X 25 + 0.5% X 500 = 1 + 0.5 + 2.5 = 4.0 

Thus, the processor with the secondary cache is faster by 



11.0 

4.0 



= 2.8 



Alternatively, we could have computed the stall cycles by summing the stall cycles 
of those references that hit in the secondary cache ((2% - 0.5%) x 25 = 0.4) and 
those references that go to main memory, which must include the cost to access 
the secondary cache as well as the main memory access time (0.5% x (25 + 500) = 
2.6). The sum, 1.0 + 0.4 + 2.6, is again 4.0. 
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The design considerations for a primary and secondary cache are significantly 
different because the presence of the other cache changes the best choice versus a 
single-level cache. In particular, a two-level cache structure allows the primary 
cache to focus on minimizing hit time to yield a shorter clock cycle, while allow- 
ing the secondary cache to focus on miss rate to reduce the penalty of long mem- 
ory access times. 

The interaction of the two caches permits such a focus. The miss penalty of the 
primary cache is significantly reduced by the presence of the secondary cache, 
allowing the primary to be smaller and have a higher miss rate. For the secondary 
cache, access time becomes less important with the presence of the primary cache, 
since the access time of the secondary cache affects the miss penalty of the pri- 
mary cache, rather than directly affecting the primary cache hit time or the pro- 
cessor cycle time. 

The effect of these changes on the two caches can be seen by comparing each 
cache to the optimal design for a single level of cache. In comparison to a single- 
level cache, the primary cache of a multilevel cache is often smaller. Furthermore, 
the primary cache often uses a smaller block size, to go with the smaller cache size 
and reduced miss penalty. In comparison, the secondary cache will often be larger 
than in a single-level cache, since the access tune of the secondary cache is less 
critical. With a larger total size, the secondary cache often will use a larger block 
size than appropriate with a single-level cache 



multilevel each* A memory 
hierarchy with multiple levels of 
caches> rather than just a cache 
and main memory. 



In Chapter 2, we saw that Quicksort had an algorithmic advantage over Bubble 
Sort that could not be overcome by language or compiler optimization. Figure 
7. 18(a) shows instructions executed by item searched for Radix Sort versus Quick- 
sort. Indeed, for large arrays, Radix Sort has an algorithmic advantage over quick- 
sort in terms of number of operations. Figure 7.18(b) shows time per key instead 
of instructions executed. We see that the lines start on the same trajectory as Fig- 
ure 7.18(a), but then the Radix Sort line diverges as the data to sort increases. 
What is going on? Figure 7.18(c) answers by looking at the cache misses per item 
sorted: Quicksort consistently has many fewer misses per item to be sorted. 

Alas, standard algorithmic analysis ignores the impact of the memory hierar- 
chy. As faster clock rates and Moore's law allow architects to squeeze all of the per- 
formance out of a stream of instructions, using the memory hierarchy well is 
critical to high performance. As we said in the introduction, understanding the 
behavior of the memory hierarchy is critical to understanding the performance of 
programs on today's computers. 



Understanding 

Program 

Performance 
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FIGURE 7.18 Comparing Quicksort and Radix Sort by (a) instructions executed per item 
sorted, (b) time per item sorted, and (c) cache misses per item sorted. This data is from a 
paper by LaMarca and Ladner [1996]. Although the numbers would change for newer computers, the 
idea still holds. Due to such results, new versions of Radix Sort have been invented that take memory hierar- 
chy into account, to regain its algorithmic advantages (see Section 7.7). The basic idea of cache optimiza- 
tions is to use all the data in a block repeatedly before it is replaced on a miss. 
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Elaboration: Multilevel caches create several complications. First, there are now 
several different types of misses and corresponding miss rates. In the example on 
page 499, we saw the primary cache miss rate and the global miss rate — the fraction 
of references that missed in all cache levels. There is also a miss rate for the second- 
ary cache, which is the ratio of all misses in the secondary cache divided by the num- 
ber of accesses. This miss rate is called the local miss rate of the secondary cache. 
Because the primary cache filters accesses, especially those with good spatial and 
temporal locality, the local miss rate of the secondary cache is much higher than the 
global miss rate. For the example on page 499, we can compute the local miss rate of 
the secondary cache as: 0.5%/2% = 25%! Luckily, the global miss rate dictates how 
often we must access the main memory. 

Additional complications arise because the caches may have different block sizes to 
match the larger or smaller total size. Likewise, the associativity of the cache may 
change. On-chip caches are often built with associativity of four or higher, while off-chip 
caches rarely have associativity of greater than two. On chip LI caches tend to have 
lower associativity than one chip L2 caches since fast hit time is more important for LI 
caches. These changes in block size and associativity introduce complications in the 
modeling of the caches, which typically mean that all levels need to be simulated 
together to understand the behavior. 



global miss rate The fraction 
of references that miss in all lev- 
els of a multilevel cache. 

local miss rate The fraction of 
references to one level of a cache 
that miss; used in multilevel 
hierarchies. 



Elaboration: With out-of-order processors, performance is more complex, since they 
execute instructions during the miss penalty. Instead of instruction miss rate and data 
miss rates, we use misses per instruction, and this formula: 



Memory stall cycles Misses 



Instruction 



Instruction 



x (Total miss latency- Overlapped miss latency) 



There is no general way to calculate overlapped miss latency, so evaluations of 
memory hierarchies for out-of-order processors inevitably require simulation of the pro- 
cessor and memory hierarchy. Only by seeing the execution of the processor during 
each miss can we see if the processor stalls waiting for data or simply finds other work 
to do. A guideline is that the processor often hides the miss penalty for an LI cache 
miss that hits in the L2 cache, but it rarely hides a miss to the L2 cache. 



Elaboration: The performance challenge for algorithms is that the memory hierarchy 
varies between different implementations of the same architecture in cache size, asso- 
ciativity, block size, and number of caches. To copy with such variability, some recent 
numerical libraries parameterize their algorithms and then search the parameter space 
at runtime to find the best combination for a particular computer. 
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Check 
Yourself 



Which of the following is generally true about a design with multiple levels of 
caches? 

1. First-level caches are more concerned about hit time, and second-level 
caches are more concerned about miss rate. 

2. First-level caches are more concerned about miss rate, and second-level 
caches are more concerned about hit time. 



Summary 

In this section, we focused on three topics: cache performance, using associativity 
to reduce miss rates, and the use of multilevel cache hierarchies to reduce miss 
penalties. 

Since the total number of cycles spent on a program is the sum of the processor 
cycles and the memory-stall cycles, the memory system can have a significant effect 
on program execution time. In fact, as processors get faster (by lowering CPI or by 
increasing the clock rate or both), the relative effect of the memory-stall cycles 
increases, making good memory systems critical to achieving high performance. 
The number of memory-stall cycles depends on both the miss rate and the miss 
penalty. The challenge, as we will see in Section 7.5, is to reduce one of these factors 
without significantly affecting other critical factors in the memory hierarchy. 

To reduce the miss rate, we examined the use of associative placement schemes. 
Such schemes can reduce the miss rate of a cache by allowing more flexible place- 
ment of blocks within the cache. Fully associative schemes allow blocks to be 
placed anywhere, but also require that every block in the cache be searched to sat- 
isfy a request. This search is usually implemented by having a comparator per 
cache block and searching the tags in parallel. The cost of the comparators makes 
large fully associative caches impractical. Set-associative caches are a practical alter- 
native, since we need only search among the elements of a unique set that is cho- 
sen by indexing. Set-associative caches have higher miss rates but are faster to 
access. The amount of associativity that yields the best performance depends on 
both the technology and the details of the implementation. 

Finally, we looked at multilevel caches as a technique to reduce the miss pen- 
alty by allowing a larger secondary cache to handle misses to the primary cache. 
Second-level caches have become commonplace as designers find that limited 
silicon and the goals of high clock rates prevent primary caches from becoming 
large. The secondary cache, which is often 10 or more times larger than the pri- 
mary cache, handles many accesses that miss in the primary cache. In such 
cases, the miss penalty is that of the access time to the secondary cache (typically 
< 10 processor cycles) versus the access time to memory (typically > 100 proces- 
sor cycles). As with associativity, the design trade-offs between the size of the 
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secondary cache and its access time depend on a number of aspects of the 
implementation. 



7.4 



Virtual Memory 



In the previous section, we saw how caches provided fast access to recently used 
portions of a program's code and data. Similarly, the main memory can act as a 
"cache" for the secondary storage, usually implemented with magnetic disks. This 
technique is called virtual memory. Historically, there were two major motiva- 
tions for virtual memory: to allow efficient and safe sharing of memory among 
multiple programs, and to remove the programming burdens of a small, limited 
amount of main memory. Four decades after its invention, it's the former reason 
that reigns today. 

Consider a collection of programs running at once on a computer. The total 
memory required by all the programs may be much larger than the amount of 
main memory available on the computer, but only a fraction of this memory is 
actively being used at any point in time. Main memory need contain only the 
active portions of the many programs, just as a cache contains only the active por- 
tion of one program. Thus, the principle of locality enables virtual memory as 
well as caches, and virtual memory allows us to efficiently share the processor as 
well as the main memory. Of course, to allow multiple programs to share the same 
memory, we must be able to protect the programs from each other, ensuring that 
a program can only read and write the portions of main memory that have been 
assigned to it. 

We cannot know which programs will share the memory with other pro- 
grams when we compile them. In fact, the programs sharing the memory 
change dynamically while the programs are running. Because of this dynamic 
interaction, we would like to compile each program into its own address space — 
separate range of memory locations accessible only to this program. Virtual 
memory implements the translation of a program's address space to physical 
addresses. This translation process enforces protection of a program's address 
space from other programs. 

The second motivation for virtual memory is to allow a single user program to 
exceed the size of primary memory. Formerly, if a program became too large for 
memory, it was up to the programmer to make it fit. Programmers divided pro- 
grams into pieces and then identified the pieces that were mutually exclusive. 
These overlays were loaded or unloaded under user program control during exe- 
cution, with the programmer ensuring that the program never tried to access an 



. . . a system has been 
devised to make the core 
drum combination appear to 

the programmer as a single 
level store, the requisite 
transfers taking place auto- 
matically. 

Kilburn et al, "One-level stor- 
age system" 1962 

virtual memory A technique 
that uses main memory as a 
"cache" for secondary storage. 



physical address An address 
in main memory. 

protection A set of mecha- 
nisms for ensuring that multiple 
processes sharing the processor, 
memory, or I/O devices cannot 
interfere, intentionally or unin- 
tentionally, with one another by 
reading or writing each other's 
data. These mechanisms also 
isolate the operating system 
from a user process. 
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page fault An event that occurs 
when an accessed page is not 
present in main memory. 

virtual address An address 
that corresponds to a location in 
virtual space and is translated by 
address mapping to a physical 
address when memory is 
accessed. 

address translation Also 
called address mapping. The 
process by which a virtual 
address is mapped to an address 
used to access memory. 



overlay that was not loaded and that the overlays loaded never exceeded the total 
size of the memory. Overlays were traditionally organized as modules, each con- 
taining both code and data. Calls between procedures in different modules would 
lead to overlaying of one module with another. 

As you can well imagine, this responsibility was a substantial burden on pro- 
grammers. Virtual memory, which was invented to relieve programmers of this 
difficulty, automatically manages the two levels of the memory hierarchy repre- 
sented by main memory (sometimes called physical memory to distinguish it from 
virtual memory) and secondary storage. 

Although the concepts at work in virtual memory and in caches are the 
same, their differing historical roots have led to the use of different termin- 
ology. A virtual memory block is called a page, and a virtual memory miss is 
called a page fault. With virtual memory, the processor produces a virtual 
address, which is translated by a combination of hardware and software to a 
physical address, which in turn can be used to access main memory. 
Figure 7.19 shows the virtually addressed memory with pages mapped to main 
memory. This process is called address mapping or address translation. Today, 
the two memory hierarchy levels controlled by virtual memory are DRAMs 



Virtual addresses 



Physical addresses 



Address translation 




Disk addresses 



FIGURE 7.19 In virtual memory, blocks of memory (called pages) are mapped from one 
set of addresses (called virtual addresses) to another set (called physical addresses). 

The processor generates virtual addresses while the memory is accessed using physical addresses. Both the 
virtual memory and the physical memory are broken into pages, so that a virtual page is really mapped to a 
physical page. Of course, it is also possible for a virtual page to be absent from main memory and not be 
mapped to a physical address, residing instead on disk. Physical pages can be shared by having two virtual 
addresses point to the same physical address. This capability is used to allow two different programs to share 
data or code. 
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and magnetic disks (see Chapter 1, pages 5, 13 and 23). If we return to our 
library analogy, we can think of a virtual address as the title of a book and a 
physical address as the location of that book in the library, such as might be 
given by the Library of Congress call number. 

Virtual memory also simplifies loading the program for execution by provid- 
ing relocation. Relocation maps the virtual addresses used by a program to dif- 
ferent physical addresses before the addresses are used to access memory. This 
relocation allows us to load the program anywhere in main memory. Further- 
more, all virtual memory systems in use today relocate the program as a set of 
fixed-size blocks (pages), thereby eliminating the need to find a contiguous 
block of memory to allocate to a program; instead, the operating system need 
only find a sufficient number of pages in main memory. Formerly, relocation 
problems required special hardware and special support in the operating sys- 
tem; today, virtual memory also provides this function. 

In virtual memory, the address is broken into a virtual page number and a page 
offset. Figure 7.20 shows the translation of the virtual page number to a physical 
page number. The physical page number constitutes the upper portion of the 
physical address, while the page offset, which is not changed, constitutes the lower 
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FIGURE 7.20 Mapping from a virtual to a physical address. The page size is 2" = 4 KB. The 

number of physical pages allowed in memory is 2 , since the physical page number has 18 bits in it. Thus, 
main memory can have at most 1 GB, while the virtual address space is 4 GB. 
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portion. The number of bits in the page-offset field determines the page size. The 
number of pages addressable with the virtual address need not match the number 
of pages addressable with the physical address. Having a larger number of virtual 
pages than physical pages is the basis for the illusion of an essentially unbounded 
amount of virtual memory. 

Many design choices in virtual memory systems are motivated by the high cost 
of a miss, which in virtual memory is traditionally called a page fault. A page fault 
will take millions of clock cycles to process. (The table on page 469 shows that 
main memory is about 100>000 times faster than disk.) This enormous miss pen- 
alty, dominated by the time to get the first word for typical page sizes, leads to sev- 
eral key decisions in designing virtual memory systems: 

■ Pages should be large enough to try to amortize the high access time. Sizes 
from 4 KB to 16 KB are typical today. New desktop and server systems are 
being developed to support 32 KB and 64 KB pages, but new embedded sys- 
tems are going in the other direction, to 1 KB pages. 

■ Organizations that reduce the page fault rate are attractive. The primary tech- 
nique used here is to allow fully associative placement of pages in memory. 

■ Page faults can be handled in software because the overhead will be small 
compared to the disk access time. In addition, software can afford to use 
clever algorithms for choosing how to place pages because even small reduc- 
tions in the miss rate will pay for the cost of such algorithms. 

■ Write-through will not work for virtual memory, since writes take too long. 
Instead, virtual memory systems use write-back. 

The next few subsections address these factors in virtual memory design. 



segmentation A variable-size 
address mapping scheme in 
which an address consists of two 
parts: a segment number, which 
is mapped to a physical address, 
and a segment offset. 



Elaboration: Although we normally think of virtual addresses as much larger than 
physical addresses, the opposite can occur when the processor address size is small rel- 
ative to the state of the memory technology. No single program can benefit, but a collec- 
tion of programs running at the same time can benefit from not having to be swapped to 
memory or by running on parallel processors. Given that Moore's law applies to DRAM, 
32-bit processors are already problematic for servers and soon for desktops. 

Elaboration: The discussion of virtual memory in this book focuses on paging, which 
uses fixed-size blocks. There is also a variable-size block scheme called segmentation. 
In segmentation, an address consists of two parts: a segment number and a segment 
offset. The segment register is mapped to a physical address, and the offset is added 
to find the actual physical address. Because the segment can vary in size, a bounds 
check is also needed to make sure that the offset is within the segment. The major use 
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of segmentation is to support more powerful methods of protection and sharing in an 
address space. Most operating system textbooks contain extensive discussions of seg- 
mentation compared to paging and of the use of segmentation to logically share the 
address space. The major disadvantage of segmentation is that it splits the address 
space into logically separate pieces that must be manipulated as a two-part 
address: the segment number and the offset. Paging, in contrast, makes the boundary 
between page number and offset invisible to programmers and compilers. 

Segments have also been used as a method to extend the address space without 
changing the word size of the computer. Such attempts have been unsuccessful 
because of the awkwardness and performance penalties inherent in a two-part address 
of which programmers and compilers must be aware. 

Many architectures divide the address space into large fixed-size blocks that sim- 
plify protection between the operating system and user programs and increase the effi- 
ciency of implementing paging. Although these divisions are often called "segments," 
this mechanism is much simpler than variable block size segmentation and is not visi- 
ble to user programs; we discuss it in more detail shortly. 



Placing a Page and Finding It Again 

Because of the incredibly high penalty for a page fault, designers reduce page fault 
frequency by optimizing page placement. If we allow a virtual page to be mapped 
to any physical page, the operating system can then choose to replace any page it 
wants when a page fault occurs. For example, the operating system can use a 
sophisticated algorithm and complex data structures, which track page usage, to 
try to choose a page that will not be needed for a long time. The ability to use a 
clever and flexible replacement scheme reduces the page fault rate and simplifies 
the use of fully associative placement of pages. 

As mentioned in Section 7.3, the difficulty in using fully associative place- 
ment is in locating an entry, since it can be anywhere in the upper level of the 
hierarchy. A full search is impractical. In virtual memory systems, we locate 
pages by using a table that indexes the memory; this structure is called a page 
table and resides in memory. A page table is indexed with the page number 
from the virtual address to discover the corresponding physical page number. 
Each program has its own page table, which maps the virtual address space of 
that program to main memory. In our library analogy, the page table corre- 
sponds to a mapping between book titles and library locations. Just as the card 
catalog may contain entries for books in another library on campus rather than 
the local branch library, we will see that the page table may contain entries for 
pages not present in memory. To indicate the location of the page table in mem- 
ory, the hardware includes a register that points to the start of the page table; we 
call this the page table register. Assume for now that the page table is in a fixed 
and contiguous area of memory. 



page table The table contain- 
ing the virtual to physical 
address translations in a virtual 
memory system. The table, 
which is stored in memory, is 
typically indexed by the virtual 
page number; each entry in the 
table contains the physical page 
number for that virtual page if 
the page is currently in memory. 
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Hardware 
Software 
Interface 



The page table, together with the program counter and the registers, specifies the 
state of a program. If we want to allow another program to use the processor, we 
must save this state. Later, after restoring this state, the program can continue 
execution. We often refer to this state as a process. The process is considered active 
when it is in possession of the processor; otherwise, it is considered inactive. The 
operating system can make a process active by loading the process's state, includ- 
ing the program counter, which will initiate execution at the value of the saved 
program counter. 

The process's address space, and hence all the data it can access in memory, is 
defined by its page table, which resides in memory. Rather than save the entire 
page table, the operating system simply loads the page table register to point to 
the page table of the process it wants to make active. Each process has its own page 
table, since different processes use the same virtual addresses. The operating sys- 
tem is responsible for allocating the physical memory and updating the page 
tables, so that the virtual address spaces of different processes do not collide. As 
we will see shortly, the use of separate page tables also provides protection of one 
process from another. 



Figure 7.21 uses the page table register, the virtual address, and the indicated 
page table to show how the hardware can form a physical address. A valid bit is 
used in each page table entry, just as we did in a cache. If the bit is oft, the page is 
not present in main memory and a page fault occurs. If the bit is on, the page is 
in memory and the entry contains the physical page number. 

Because the page table contains a mapping for every possible virtual page, no 
tags are required. In cache terminology, the index that is used to access the page 
table consists of the full block address, which is the virtual page number. 



Page Faults 

If the valid bit for a virtual page is off, a page fault occurs. The operating system 
must be given control. This transfer is done with the exception mechanism, which 
we discuss later in this section. Once the operating system gets control, it must 
find the page in the next level of the hierarchy (usually magnetic disk) and decide 
where to place the requested page in main memory. 

The virtual address alone does not immediately tell us where the page is on 
disk. Returning to our library analogy, we cannot find the location of a library 
book on the shelves just by knowing its title. Instead, we go to the catalog and look 
up the book, obtaining an address for the location on the shelves, such as the 
Library of Congress call number. Likewise, in a virtual memory system, we must 
keep track of the location on disk of each page in virtual address space. 
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FIGURE 7.21 The page table is indexed with the virtual page number to obtain the corresponding portion of the physical 

address. The starting address of the page table is given by the page table pointer. In this figure, the page size is 2 12 bytes, or 4 KB. The virtual address 
space is 2P- bytes, or 4 GB, and the physical address space is 2 30 bytes, which allows main memory of up to 1 GB. The number of entries in the page 
table is 2 , or 1 million entries. The valid bit for each entry indicates whether the mapping is legal. If it is off, then the page is not present in memory. 
Although the page table entry shown here need only be 19 bits wide, it would typically be rounded up to 32 bits for ease of indexing. The extra bits 
would be used to store additional information that needs to be kept on a per-page basis, such as protection. 



Because we do not know ahead of time when a page in memory will be chosen 
to be replaced, the operating system usually creates the space on disk for all the 
pages of a process when it creates the process. This disk space is called the swap 
space. At that time, it also creates a data structure to record where each virtual 
page is stored on disk. This data structure may be part of the page table or may be 
an auxiliary data structure indexed in the same way as the page table. Figure 7.22 



swap space The space on the 
disk reserved for the full virtual 
memory space of a process. 
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FIGURE 7.22 The page table maps each page in virtual memory to either a page in 

main memory or a page stored on disk, which is the next level in the hierarchy. The vir- 
tual page number is used to index the page table. If the valid bit is on, the page table supplies the physical 
page number (i.e., the starting address of the page in memory) corresponding to the virtual page. If the 
valid bit is off, the page currently resides only on disk, at a specified disk address. In many systems, the 
table of physical page addresses and disk page addresses, while logically one table, is stored in two sepa- 
rate data structures. Dual tables are justified in part because we must keep the disk addresses of all the 
pages, even if they are currently in main memory. Remember that the pages in main memory and the 
pages on disk are identical in size. 



shows the organization when a single table holds either the physical page number 
or the disk address. 

The operating system also creates a data structure that tracks which processes 
and which virtual addresses use each physical page. When a page fault occurs, if all 
the pages in main memory are in use, the operating system must choose a page to 
replace. Because we want to minimize the number of page faults, most operating 
systems try to choose a page that they hypothesize will not be needed in the near 
future. Using the past to predict the future, operating systems follow the least 
recently used (LRU) replacement scheme, which we mentioned in Section 7.3. 
The operating system searches for the least recently used page, making the 
assumption that a page that has not been used in a long time is less likely to be 
needed than a more recently accessed page. The replaced pages are written to swap 
space on the disk. In case you are wondering, the operating system is just another 
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process, and these tables controlling memory are in memory; the details of this 
seeming contradiction will be explained shortly. 

For example, suppose the page references (in order) were 10, 12, 9, 7, 11, 10, 
and then we referenced page 8, which was not present in memory. The LRU page 
is 12; in LRU replacement, we would replace page 12 in main memory with page 
8. If the next reference also generated a page fault, we would replace page 9, since 
it would then be the LRU among the pages present in memory. 



Implementing a completely accurate LRU scheme is too expensive, since it 
requires updating a data structure on every memory reference. Instead, most 
operating systems approximate LRU by keeping track of which pages have and 
which pages have not been recently used. To help the operating system estimate 
the LRU pages, some computers provide a use bit or reference bit, which is set 
whenever a page is accessed. The operating system periodically clears the refer- 
ence bits and later records them so it can determine which pages were touched 
during a particular time period. With this usage information, the operating sys- 
tem can select a page that is among the least recently referenced (detected by hav- 
ing its reference bit off). If this bit is not provided by the hardware, the operating 
system must find another way to estimate which pages have been accessed. 



Hardware 
Software 
Interface 

reference bit Also called use 
bit. A field that is set whenever a 
page is accessed and that is used 
to implement LRU or other 
replacement schemes. 



Elaboration: With a 32-bit virtual address, 4 KB pages, and 4 bytes per page table 
entry, we can compute the total page table size: 



Number of page table entries = ^ — = 2 20 

2 12 



Size of page table = 2 20 page table entries x 2 2 



bytes 



page table entry 



= 4 MB 



That is, we would need to use 4 MB of memory for each program in execution at any 
time. On a computer with tens to hundreds of active programs and a fixed-size page 
table, most or all of the memory would be tied up in page tables! 

A range of techniques is used to reduce the amount of storage required for the page 
table. The five techniques below aim at reducing the total maximum storage required as 
well as minimizing the main memory dedicated to page tables: 



1. The simplest technique is to keep a limit register that restricts the size of the page 
table for a given process. If the virtual page number becomes larger than the con- 
tents of the limit register, entries must be added to the page table. This technique 



520 Chapter 7 Large and Fast: Exploiting Memory Hierarchy 



allows the page table to grow as a process consumes more space. Thus, the page 
table will only be large if the process is using many pages of virtual address space. 
This technique requires that the address space expand in only one direction. 

2. Allowing growth in only one direction is not sufficient, since most languages re- 
quire two areas whose size is expandable: one area holds the stack and the other 
area holds the heap. Because of this duality, it is convenient to divide the page 
table and let it grow from the highest address down, as well as from the lowest 
address up. This means that there will be two separate page tables and two sep- 
arate limits. The use of two page tables breaks the address space into two seg- 
ments. The high-order bit of an address usually determines which segment and 
thus which page table to use for that address. Since the segment is specified by 
the high-order address bit, each segment can be as large as one-half of the ad- 
dress space. A limit register for each segment specifies the current size of the seg- 
ment, which grows in units of pages. This type of segmentation is used by many 
architectures, including MIPS. Unlike the type of segmentation discussed in the 
Elaboration on page 514, this form of segmentation is invisible to the application 
program, although not to the operating system. The major disadvantage of this 
scheme is that it does not work well when the address space is used in a sparse 
fashion rather than as a contiguous set of virtual addresses. 

3. Another approach to reducing the page table size is to apply a hashing function to 
the virtual address so that the page table data structure need be only the size of 
the number of physical pages in main memory. Such a structure is called an invert- 
ed page table. Of course, the lookup process is slightly more complex with an in- 
verted page table because we can no longer just index the page table. 

4. Multiple levels of page tables can also be used to reduce the total amount of page 
table storage. The first level maps large fixed-size blocks of virtual address space, 
perhaps 64 to 256 pages in total. These large blocks are sometimes called seg- 
ments, and this first-level mapping table is sometimes called a segment table, 
though the segments are invisible to the user. Each entry in the segment table in- 
dicates whether any pages in that segment are allocated and, if so, points to a 
page table for that segment. Address translation happens by first looking in the 
segment table, using the highest-order bits of the address. If the segment address 
is valid, the next set of high-order bits is used to index the page table indicated by 
the segment table entry. This scheme allows the address space to be used in a 
sparse fashion (multiple noncontiguous segments can be active) without having to 
allocate the entire page table. Such schemes are particularly useful with very large 
address spaces and in software systems that require noncontiguous allocation. 
The primary disadvantage of this two-level mapping is the more complex process 
for address translation. 

5. To reduce the actual main memory tied up in page tables, most modern systems 
also allow the page tables to be paged. Although this sounds tricky, it works by 
using the same basic ideas of virtual memory and simply allowing the page tables 
to reside in the virtual address space. In addition, there are some small but critical 
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problems, such as a never-ending series of page faults, which must be avoided. 
How these problems are overcome is both very detailed and typically highly pro- 
cessor specific. In brief, these problems are avoided by placing all the page tables 
in the address space of the operating system and placing at least some of the 
page tables for the system in a portion of main memory that is physically ad- 
dressed and is always present and thus never on disk. 

What about Writes? 

The difference between the access time to the cache and main memory is tens to 
hundreds of cycles, and write-through schemes can be used, although we need a 
write buffer to hide the latency of the write from the processor. In a virtual mem- 
ory system, writes to the next level of the hierarchy (disk) take millions of proces- 
sor clock cycles; therefore, building a write buffer to allow the system to write 
through to disk would be completely impractical. Instead, virtual memory sys- 
tems must use write-back, performing the individual writes into the page in 
memory and copying the page back to disk when it is replaced in the memory. 
This copying back to the lower level in the hierarchy is the source of the other 
name for this technique of handling writes, namely, copy back. 



A write-back scheme has another major advantage in a virtual memory system. 
Because the disk transfer time is small compared with its access time, copying 
back an entire page is much more efficient than writing individual words back to 
the disk. A write-back operation, although more efficient than transferring indi- 
vidual words, is still costly. Thus, we would like to know whether a page needs to 
be copied back when we choose to replace it. To track whether a page has been 
written since it was read into the memory, a dirty bit is added to the page table. 
The dirty bit is set when any word in a page is written. If the operating system 
chooses to replace the page, the dirty bit indicates whether the page needs to be 
written out before its location in memory can be given to another page. 



Hardware 
Software 
Interface 



Making Address Translation Fast: The TLB 

Since the page tables are stored in main memory, every memory access by a program 
can take at least twice as long: one memory access to obtain the physical address and 
a second access to get the data. The key to improving access performance is to rely on 
locality of reference to the page table. When a translation for a virtual page number is 
used, it will probably be needed again in the near future because the references to the 
words on that page have both temporal and spatial locality. 



522 



Chapter 7 Large and Fast: Exploiting Memory Hierarchy 



translation-lookaside buffer 
(TLB) A cache that keeps track 
of recently used address map- 
pings to avoid an access to the 
page table. 



Accordingly, modern processors include a special cache that keeps track of 
recently used translations. This special address translation cache is traditionally 
referred to as a translation-lookaside buffer (TLB), although it would be more 
accurate to call it a translation cache. The TLB corresponds to that little piece of 
paper we typically use to record the location of a set of books we look up in the 
card catalog; rather than continually searching the entire catalog, we record the 
location of several books and use the scrap of paper as a cache of Library of Con- 
gress call numbers. 

Figure 7.23 shows that each tag entry in the TLB holds a portion of the virtual 
page number, and each data entry of the TLB holds a physical page number. Because 
we will no longer access the page table on every reference, instead accessing the TLB, 
the TLB will need to include other bits, such as the dirty and the reference bit. 
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FIGURE 7.23 The TLB acts as a cache on the page table for the entries that map to physical pages only. The TLB contains a sub- 
set of the virtual-to-physical page mappings that are in the page table. The TLB mappings are shown in color. Because the TLB is a cache, it must have 
a tag field. If there is no matching entry in the TLB for a page, the page table must be examined. The page table either supplies a physical page number 
for the page (which can then be used to build a TLB entry) or indicates that the page resides on disk, in which case a page fault occurs. Since the page 
table has an entry for every virtual page, no tag field is needed; in other words, it is not a cache, 
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On every reference, we look up the virtual page number in the TLB. If we get a 
hit, the physical page number is used to form the address, and the corresponding 
reference bit is turned on. If the processor is performing a write, the dirty bit is 
also turned on. If a miss in the TLB occurs, we must determine whether it is a 
page fault or merely a TLB miss. If the page exists in memory, then the TLB miss 
indicates only that the translation is missing. In such cases, the processor can 
handle the TLB miss by loading the translation from the page table into the TLB 
and then trying the reference again. If the page is not present in memory, then 
the TLB miss indicates a true page fault. In this case, the processor invokes the 
operating system using an exception. Because the TLB has many fewer entries 
than the number of pages in main memory, TLB misses will be much more fre- 
quent than true page faults. 

TLB misses can be handled either in hardware or in software. In practice, with 
care there can be little performance difference between the two approaches 
because the basic operations are the same in either case. 

After a TLB miss occurs and the missing translation has been retrieved from 
the page table, we will need to select a TLB entry to replace. Because the reference 
and dirty bits are contained in the TLB entry, we need to copy these bits back to 
the page table entry when we replace an entry. These bits are the only portion of 
the TLB entry that can be changed. Using write-back — that is, copying these 
entries back at miss time rather than when they are written — is very efficient, 
since we expect the TLB miss rate to be small. Some systems use other techniques 
to approximate the reference and dirty bits, eliminating the need to write into the 
TLB except to load a new table entry on a miss. 

Some typical values for a TLB might be 

■ TLB size: 16-512 entries 

■ Block size: 1-2 page table entries (typically 4-8 bytes each) 

■ Hit time: 0.5-1 clock cycle 

■ Miss penalty: 10-100 clock cycles 

■ Miss rate: 0.01%-1% 

Designers have used a wide variety of associativities in TLBs. Some systems use 
small, fully associative TLBs because a fully associative mapping has a lower miss 
rate; furthermore, since the TLB is small, the cost of a fully associative mapping is 
not too high. Other systems use large TLBs, often with small associativity. With a 
fully associative mapping, choosing the entry to replace becomes tricky since 
implementing a hardware LRU scheme is too expensive. Furthermore, since TLB 
misses are much more frequent than page faults and thus must be handled more 
cheaply, we cannot afford an expensive software algorithm, as we can for page 
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faults. As a result, many systems provide some support for randomly choosing an 
entry to replace. We'll examine replacement schemes in a little more detail in 
Section 7.5. 

The Intrinsity FastMATH TLB 

To see these ideas in a real processor, let's take a closer look at the TLB of the 
Intrinsity FastMATH. The memory system uses 4 KB pages and a 32-bit address 
space; thus, the virtual page number is 20 bits long, as in the top of Figure 7.24. 
The physical address is the same size as the virtual address. The TLB contains 16 
entries, is fully associative, and is shared between the instruction and data refer- 
ences. Each entry is 64 bits wide and contains a 20-bit tag (which is the virtual 
page number for that TLB entry), the corresponding physical page number (also 
20 bits), a valid bit, a dirty bit, and other bookkeeping bits. 

Figure 7.24 shows the TLB and one of the caches, while Figure 7.25 shows the 
steps in processing a read or write request. When a TLB miss occurs, the MIPS 
hardware saves the page number of the reference in a special register and generates 
an exception. The exception invokes the operating system, which handles the miss 
in software. To find the physical address for the missing page, the TLB miss rou- 
tine indexes the page table using the page number of the virtual address and the 
page table register, which indicates the starting address of the active process page 
table. Using a special set of system instructions that can update the TLB, the oper- 
ating system places the physical address from the page table into the TLB. A TLB 
miss takes about 13 clock cycles, assuming the code and the page table entry are in 
the instruction cache and data cache, respectively. (We will see the MIPS TLB 
code on page 534) A true page fault occurs if the page table entry does not have a 
valid physical address. The hardware maintains an index that indicates the recom- 
mended entry to replace; the recommended entry is chosen randomly. 

There is an extra complication for write requests: namely, the write access bit in 
the TLB must be checked. This bit prevents the program from writing into pages 
for which it has only read access. If the program attempts a write and the write 
access bit is off, an exception is generated. The write access bit forms part of the 
protection mechanism, which we discuss shortly. 

Integrating Virtual Memory, TLBs, and Caches 

Our virtual memory and cache systems work together as a hierarchy, so that data 
cannot be in the cache unless it is present in main memory. The operating system 
plays an important role in maintaining this hierarchy by flushing the contents of 
any page from the cache, when it decides to migrate that page to disk. At the same 
time, the OS modifies the page tables and TLB, so that an attempt to access any 
data on the page will generate a page fault. 

Under the best of circumstances, a virtual address is translated by the TLB and 
sent to the cache where the appropriate data is found, retrieved, and sent back to 
the processor. In the worst case, a reference can miss in all three components of 
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FIGURE 7.24 The TLB and cache implement the process of going from a virtual address to a data item in the Intrinsity Fast- 

MATH. This figure shows the organization of the TLB and the data cache assuming a 4 KB page size. This diagram focuses on a read; Figure 7.25 
describes how to handle writes. Note that unlike Figure 7.9 on page 486, the tag and data RAMs are split. By addressing the long but narrow data RAM 
with the cache index concatenated with the block offset, we select the desired word in the block without a 16:1 multiplexor. While the cache is direct 
mapped, the TLB is fully associative. Implementing a fully associative TLB requires that every TLB tag be compared against the virtual page number, 
since the entry of interest can be anywhere in the TLB. If the valid bit of the matching entry is on, the access is a TLB hit, and bits from the physical 
page number together with bits from the page offset form the index that is used to access the cache. (The Intrinsity actually has a 16 KB page size; the 
Elaboration on page 528 explains how it works.) 



526 



Chapter 7 Large and Fast: Exploiting Memory Hierarchy 



Virtual address 



TLB miss No 
exception n 




Physical address 



Yes 



Try to read data 
from cache 



Cache miss stall 
while read block 





Write protection 
exception 



Deliver data 
to the CPU 



Try to write data 
to cache 



Cache miss stall 
while read block 




Write data into cache, 

update the dirty bit, and 

put the data and the 

address into the write buffer 



FIGURE 7.25 Processing a read or a write through in the Intrinsity FastMATH TLB and cache. If the TLB generates a hit, the cache 
can be accessed with the resulting physical address. For a read, the cache generates a hit or miss and supplies the data or causes a stall while the data is 
brought from memory. If the operation is a write, a portion of the cache entry is overwritten for a hit and the data is sent to the write buffer if we 
assume write-through. A write miss is just like a read miss except that the block is modified after it is read from memory. Write-back requires writes to 
set a dirty bit for the cache block, and a write buffer is loaded with the whole block only on a read miss or write miss if the block to be replaced is dirty. 
Notice that a TLB hit and a cache hit are independent events, but a cache hit can only occur after a TLB hit occurs, which means that the data must be 
present in memory. The relationship between TLB misses and cache misses is examined further in the following example and the exercises at the end 
of this chapter. 
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the memory hierarchy: the TLB, the page table, and the cache. The following 
example illustrates these interactions in more detail.. 



Overall Operation of a Memory Hierarchy 

In a memory hierarchy like that of Figure 7.24 that includes a TLB and a 
cache organized as shown, a memory reference can encounter three different 
types of misses: a TLB miss, a page fault, and a cache miss. Consider all the 
combinations of these three events with one or more occurring (seven possi- 
bilities). For each possibility, state whether this event can actually occur and 
under what circumstances. 



Figure 7.26 shows the possible circumstances and whether they can arise in 
practice or not. 



EXAMPLE 



ANSWER 



Elaboration: Figure 7.26 assumes that all memory addresses are translated to 
physical addresses before the cache is accessed. In this organization, the cache is 
physically indexed and physically tagged (both the cache index and tag are physical, 
rather than virtual, addresses). In such a system, the amount of time to access mem- 
ory, assuming a cache hit, must accommodate both a TLB access and a cache access; 
of course, these accesses can be pipelined. 

Alternatively, the processor can index the cache with an address that is completely 
or partially virtual. This is called a virtually addressed cache, and it uses tags that are 
virtual addresses; hence, such a cache is virtually indexed and virtually tagged. In such 
caches, the address translation hardware (TLB) is unused during the normal cache 
access, since the cache is accessed with a virtual address that has not been trans- 
lated to a physical address. This takes the TLB out of the critical path, reducing cache 



virtually addressed cache A 

cache that is accessed with a vir- 
tual address rather than a physi- 
cal address. 



Page 
TLB table Cache Possible? If so, under what circumstance? 


hit 


hit 


miss 


Possible, although the page table is never really checked if TLB hits. 


miss 


hit 


hit 


TLB misses, but entry found in page table; after retry, data is found in cache. 


miss 


hit 


miss 


TLB misses, but entry found in page table; after retry, data misses in cache. 


miss 


miss 


miss 


TLB misses and is followed by a page fault; after retry, data must miss in cache. 


hit 


miss 


miss 


Impossible: cannot have a translation in TLB if page is not present in memory. 


hit 


miss 


ha 


Impossible: cannot have a translation in TLB if page is not present in memory. 


miss 


miss 


hit 


Impossible: data cannot be allowed in cache if the page is not in memory. 



FIGURE 7.26 The possible combinations of events in the TLB, virtual memory system, 
and cache. Three of these combinations are impossible, and one is possible (TLB hit, virtual memory hit, 
cache miss) but never detected. 
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aliasing A situation in which 
the same object is accessed by 
two addresses; can occur in vir- 
tual memory when there are two 
virtual addresses for the same 
physical page. 



physically addressed cache A 
cache that is addressed by a 
physical address. 



latency. When a cache miss occurs, however, the processor needs to translate the 
address to a physical address so that it can fetch the cache block from main memory. 

When the cache is accessed with a virtual address and pages are shared between 
programs (which may access them with different virtual addresses), there is the possi- 
bility of aliasing. Aliasing occurs when the same object has two names — in this case, 
two virtual addresses for the same page. This ambiguity creates a problem because a 
word on such a page may be cached in two different locations, each corresponding to 
different virtual addresses. This ambiguity would allow one program to write the data 
without the other program being aware that the data had changed. Completely virtually 
addressed caches either introduce design limitations on the cache and TLB to reduce 
aliases or require the operating system, and possibly the user, to take steps to ensure 
that aliases do not occur. 

Figure 7.24 assumed a 4 KB page size, but it's really 16 KB. The Intrinsity FastMATH 
uses such a memory system organization. The cache and TLB are still accessed in par- 
allel, so the upper 2 bits of the cache index must be virtual. Hence, up to four cache 
entries could be aliased to the same physical memory address. As the L2 cache on the 
chip includes all entries in the LI caches, on a LI miss it checks the other three possi- 
ble cache locations in the L2 cache for aliases. If it finds one, it flushes it from the 
caches to prevent aliases from occurring. 

A common compromise between these two design points is caches that are virtually 
indexed (sometimes using just the page offset portion of the address, which is really a 
physical address since it is untranslated), but use physical tags. These designs, which 
are virtually indexed but physically tagged, attempt to achieve the performance advan- 
tages of virtually indexed caches with the architecturally simpler advantages of a physi- 
cally addressed cache. For example, there is no alias problem in this case. The LI data 
cache of the Pentium 4 is an example as would the Intrinsity if the page size was 4 KB. 
To pull off this trick, there must be careful coordination between the minimum page 
size, the cache size, and associativity. 



Elaboration: The FastMATH TLB is a bit more complicated than in Figure 7.24. MIPS 
includes two physical page mappings per virtual page number, thereby mapping an even- 
odd pair of virtual page numbers into two physical page numbers. Hence, the tag is 1 bit 
narrower since each entry corresponds to two pages. The least significant bit of the vir- 
tual page number selects between the two physical pages. There are separate book- 
keeping bits for each physical page. This optimization doubles the amount of memory 
mapped per TLB entry. As the Elaboration on page 530 explains, the tag field actually 
includes an 8-bit address space ID field to reduce the cost of context switches. To sup- 
port the variable page sizes mentioned on page 537, there is also a 32-bit mask field 
that determines the dividing line between the virtual page address and the page offset. 



Implementing Protection with Virtual Memory 

One of the most important functions for virtual memory is to allow sharing of a 
single main memory by multiple processes, while providing memory protection 
among these processes and the operating system. The protection mechanism must 
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ensure that although multiple processes are sharing the same main memory, one 
renegade process cannot write into the address space of another user process or 
into the operating system either intentionally or unintentionally. For example, if 
the program that maintains student grades were running on a computer at the 
same time as the programs of the students in the first programming course, we 
wouldn't want the errant program of a beginner to write over someone's grades. 
The write access bit in the TLB can protect a page from being written. Without 
this level of protection, computer viruses would be even more widespread. 



To enable the operating system to implement protection in the virtual memory sys- 
tem, the hardware must provide at least the three basic capabilities summarized below. 

1. Support at least two modes that indicate whether the running process is a 
user process or an operating system process, variously called a supervisor 
process, a kernel process, or an executive process. 

2. Provide a portion of the processor state that a user process can read but not 
write. This includes the user/supervisor mode bit, which dictates whether 
the processor is in user or supervisor mode, the page table pointer, and the 
TLB. To write these elements the operating system uses special instructions 
that are only available in supervisor mode. 

3. Provide mechanisms whereby the processor can go from user mode to 
supervisor mode, and vice versa. The first direction is typically accom- 
plished by a system call exception, implemented as a special instruction 
(syscall in the MIPS instruction set) that transfers control to a dedicated 
location in supervisor code space. As with any other exception, the program 
counter from the point of the system call is saved in the exception PC (EPC), 
and the processor is placed in supervisor mode. To return to user mode 
from the exception, use the return from exception (ERET) instruction, which 
resets to user mode and jumps to the address in EPC. 

By using these mechanisms and storing the page tables in the operating sys- 
tems address space, the operating system can change the page tables while pre- 
venting a user process from changing them, ensuring that a user process can 
access only the storage provided to it by the operating system. 



Hardware 
Software 
Interface 

kernel mode Also called 
supervisor mode. A mode 
indicating that a running pro- 
cess is an operating system 
process. 



system call A special instruc- 
tion that transfers control from 
user mode to a dedicated loca- 
tion in supervisor code space, 
invoking the exception mecha- 
nism in the process. 



We also want to prevent a process from reading the data of another process. 
For example, we wouldn't want a student program to read the grades while they 
were in the processor s memory. Once we begin sharing main memory, we must 
provide the ability for a process to protect its data from both reading and writ- 
ing by another process; otherwise, sharing the main memory will be a mixed 
blessing! 
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Remember that each process has its own virtual address space. Thus, if the 
operating system keeps the page tables organized so that the independent virtual 
pages map to disjoint physical pages, one process will not be able to access 
another's data. Of course, this also requires that a user process be unable to change 
the page table mapping. The operating system can assure safety if it prevents the 
user process from modifying its own page tables. Yet, the operating system must 
be able to modify the page tables. Placing the page tables in the protected address 
space of the operating system satisfies both requirements. 

When processes want to share information in a limited way, the operating sys- 
tem must assist them, since accessing the information of another process requires 
changing the page table of the accessing process. The write access bit can be used 
to restrict the sharing to just read sharing, and, like the rest of the page table, this 
bit can be changed only by the operating system. To allow another process, say Pi, 
to read a page owned by process P2, P2 would ask the operating system to create a 
page table entry for a virtual page in Pis address space that points to the same 
physical page that P2 wants to share. The operating system could use the write 
protection bit to prevent PL from writing the data, if that was P2's wish. Any bits 
that determine the access rights for a page must be included in both the page table 
and the TLB because the page table is accessed only on a TLB miss. 



context switch A changing of 
the internal state of the proces- 
sor to allow a different process 
to use the processor that 
includes saving the state needed 
to return to the currently exe- 
cuting process. 



Elaboration: When the operating system decides to change from running process PI 
to running process P2 (called a context switch or process switch), it must ensure that P2 
cannot get access to the page tables of PI because that would compromise protection. If 
there is no TLB, it suffices to change the page table register to point to P2's page table 
(rather than to Pi's); with a TLB, we must clear the TLB entries that belong to PI — both to 
protect the data of PI and to force the TLB to load the entries for P2. If the process 
switch rate were high, this could be quite inefficient. For example, P2 might load only a 
few TLB entries before the operating system switched back to PI. Unfortunately, PI 
would then find that all its TLB entries were gone and would have to pay TLB misses to 
reload them. This problem arises because the virtual addresses used by PI and P2 are 
the same, and we must clear out the TLB to avoid confusing these addresses. 

A common alternative is to extend the virtual address space by adding a process 
identifier or task identifier. The Intrinsity FastMATH has an 8-bit address space ID (ASID) 
field for this purpose. This small field identifies the currently running process; it is kept 
in a register loaded by the operating system when it switches processes. The process 
identifier is concatenated to the tag portion of the TLB, so that a TLB hit occurs only if 
both the page number and the process identifier match. This combination eliminates 
the need to clear the TLB, except on rare occasions. 

Similar problems can occur for a cache, since on a process switch the cache will 
contain data from the running process. These problems arise in different ways for phys- 
ically addressed and virtually addressed caches, and a variety of different solutions, 
such as process identifiers, are used to ensure that a process gets its own data. 
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Handling TLB Misses and Page Faults 

Although the translation of virtual to physical addresses with a TLB is straightfor- 
ward when we get a TLB hit, handling TLB misses and page faults are more com- 
plex. A TLB miss occurs when no entry in the TLB matches a virtual address. A 
TLB miss can indicate one of two possibilities: 

1 . The page is present in memory, and we need only create the missing TLB entry. 

2. The page is not present in memory, and we need to transfer control to the 
operating system to deal with a page fault. 

How do we know which of these two circumstances has occurred? When we pro- 
cess the TLB miss, we will look for a page table entry to bring into the TLB. If the 
matching page table entry has a valid bit that is turned off, then the corresponding 
page is not in memory and we have a page fault, rather than just a TLB miss. If the 
valid bit is on, we can simply retrieve the desired entry. 

A TLB miss can be handled in software or hardware because it will require only 
a short sequence of operations to copy a valid page table entry from memory into 
the TLB. MIPS traditionally handles a TLB miss in software. It brings in the page 
table entry from memory and then reexecutes the instruction that caused the TLB 
miss. Upon reexecuting it will get a TLB hit. If the page table entry indicates the 
page is not in memory, this time it will get a page fault exception. 

Handling a TLB miss or a page fault requires using the exception mechanism to 
interrupt the active process, transferring control to the operating system, and later 
resuming execution of the interrupted process. A page fault will be recognized 
sometime during the clock cycle used to access memory. To restart the instruction 
after the page fault is handled, the program counter of the instruction that caused 
the page fault must be saved. Just as in Chapters 5 and 6, the exception program 
counter (EPC) is used to hold this value. 

In addition, a TLB miss or page fault exception must be asserted by the end of 
the same clock cycle that the memory access occurs, so that the next clock cycle 
will begin exception processing rather than continue normal instruction execu- 
tion. If the page fault was not recognized in this clock cycle, a load instruction 
could overwrite a register, and this could be disastrous when we try to restart the 
instruction. For example, consider the instruction lw $1,0($1): the computer 
must be able to prevent the write pipeline stage from occurring; otherwise, it 
could not properly restart the instruction, since the contents of $ 1 would have 
been destroyed. A similar complication arises on stores. We must prevent the 
write into memory from actually completing when there is a page fault; this is 
usually done by deasserting the write control line to the memory. 
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Register 


CPO register number 


Description | 




EPC 


14 


Where to restart after exception 


Cause 


13 


Cause of exception 


BadVAddr 


8 


Address that caused exception 


Index 





Location in TLB to be read or written 


Random 


1 


Pseudorandom location in TLB 


Entry Lo 


2 


Physical page address and flags 


EntiyHi 


10 


Virtual page address 


Context 


4 


Page table address and page number 



FIGURE 7.27 MIPS control registers. These are considered to be in coprocessor 0, and hence are 
read using mf cO and written using intcO. 



Hardware 
Software 
Interface 



exception enable Also called 
interrupt enable. A signal or 
action that controls whether the 
process responds to an excep- 
tion or not; necessary for pre- 
venting the occurrence of 
exceptions during intervals 
before the processor has safely 
saved the state needed to restart. 



Between the time we begin executing the exception handler in the operating sys- 
tem and the time that the operating system has saved all the state of the process, 
the operating system is particularly vulnerable. For example, if another excep- 
tion occurred when we were processing the first exception in the operating sys- 
tem, the control unit would overwrite the exception program counter, making it 
impossible to return to the instruction that caused the page fault! We can avoid 
this disaster by providing the ability to disable and enable exceptions. When an 
exception first occurs, the processor sets a bit that disables all other exceptions; 
this could happen at the same time the processor sets the supervisor mode bit. 
The operating system will then save just enough state to allow it to recover if 
another exception occurs — namely, the exception program counter and Cause 
register. EPC and Cause are two of the special control registers that help with 
exceptions, TLB misses, and page faults; Figure 7.27 shows the rest. The operating 
system can then reenable exceptions. These steps make sure that exceptions will 
not cause the processor to lose any state and thereby be unable to restart execution 
of the interrupting instruction. 



Once the operating system knows the virtual address that caused the page fault, 
it must complete three steps: 

1. Look up the page table entry using the virtual address and find the location 
of the referenced page on disk. 

2. Choose a physical page to replace; if the chosen page is dirty, it must be writ- 
ten out to disk before we can bring a new virtual page into this physical page. 

3. Start a read to bring the referenced page from disk into the chosen physical 
page. 
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Of course, this last step will take millions of processor clock cycles (so will the sec- 
ond if the replaced page is dirty); accordingly, the operating system will usually 
select another process to execute in the processor until the disk access completes. 
Because the operating system has saved the state of the process, it can freely give 
control of the processor to another process. 

When the read of the page from disk is complete, the operating system can 
restore the state of the process that originally caused the page fault and execute the 
instruction that returns from the exception. This instruction will reset the proces- 
sor from kernel to user mode, as well as restore the program counter. The user 
process then reexecutes the instruction that faulted, accesses the requested page 
successfully, and continues execution. 

Page fault exceptions for data accesses are difficult to implement properly in a 
processor because of a combination of three characteristics: 

1. They occur in the middle of instructions, unlike instruction page faults. 

2. The instruction cannot be completed before handling the exception. 

3. After handling the exception, the instruction must be restarted as if nothing 
had occurred. 

Making instructions restartable, so that the exception can be handled and the 
instruction later continued, is relatively easy in an architecture like the MIPS. 
Because each instruction writes only one data item and this write occurs at the 
end of the instruction cycle, we can simply prevent the instruction from complet- 
ing (by not writing) and restart the instruction at the beginning. 

For processors with much more complex instructions that may touch many 
memory locations and write many data items, making instructions restartable is 
much harder. Processing one instruction may generate a number of page faults 
in the middle of the instruction. For example, some processors have block move 
instructions that touch thousands of data words. In such processors, instruc- 
tions often cannot be restarted from the beginning, as we do for MIPS instruc- 
tions. Instead, the instruction must be interrupted and later continued 
midstream in its execution. Resuming an instruction in the middle of its execu- 
tion usually requires saving some special state, processing the exception, and 
restoring that special state. Making this work properly requires careful and 
detailed coordination between the exception-handling code in the operating 
system and the hardware. 

Let s look in more detail at MIPS. When a TLB miss occurs, the MIPS hardware 
saves the page number of the reference in a special register called BadVAddr and 
generates an exception. 

The exception invokes the operating system, which handles the miss in software. 
Control is transferred to address 8000 0000] iex , the location of the TLB miss han- 
dler. To find the physical address for the missing page, the TLB miss routine indexes 



restartable instruction An 

instruction that can resume exe- 
cution after an exception is 
resolved without the exception's 
affecting the result of the 
instruction. 



handler Name of a software 
routine invoked to "handle" an 
exception or interrupt. 
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the page table using the page number of the virtual address and the page table regis- 
ter, which indicates the starting address of the active process page table. To make this 
indexing fast, MIPS hardware places everything you need in the special Context 
register: the upper 12 bits have the address of the base of the page table and the next 
18 bits have the virtual address of the missing page. Each page table entry is one 
word, so the last 2 bits are 0. Thus, the first two instructions copy the Context regis- 
ter into the kernel temporary register $ kl and then load the page table entry from 
that address into $ kl. Recall that $ kO and $ kl are reserved for the operating system 
to use without saving; a major reason for this convention is to make the TLB miss 
handler fast. Below is the MIPS code for a typical TLB miss handler: 

TLBmiss : 

mfcO $kl, Context # copy address of PTE into temp $kl 

lw $kl, 0($kl) # put PTE into temp $kl 

mtcO $kl, EntryLo # put PTE into special register EntryLo 

tlbwr # put EntryLo into TLB entry at Random 

eret # return from TLB miss exception 

As shown above, MIPS has a special set of system instructions to update the 
TLB. The instruction tl bwr copies from control register EntryLo into the TLB 
entry selected by the control register Random. Random implements random 
replacement, so it is basically a free-running counter. A TLB miss takes about a 
dozen clock cycles. 

Note that the TLB miss handler does not check to see if the page table entry is 
valid. Because the exception for TLB entry missing is much more frequent than a 
page fault, the operating system loads the TLB from the page table without exam- 
ining the entry and restarts the instruction. If the entry is invalid, another and dif- 
ferent exception occurs, and the operating system recognizes the page fault. This 
method makes the frequent case of a TLB miss fast, at a slight performance pen- 
alty for the infrequent case of a page fault. 

Once the process that generated the page fault has been interrupted, it transfers 
control to 8000 0180h ex , a different address than TLB miss handler. This is the 
general address for exception; TLB miss has a special entry point to lower the pen- 
alty for a TLB miss. The operating system uses the exception Cause register to 
diagnose the cause of the exception. Because the exception is a page fault, the 
operating system knows that extensive processing will be required. Thus, unlike a 
TLB miss, it saves the entire state of the active process. This state includes all the 
general-purpose and floating-point registers, the page table address register, the 
EPC, and the exception Cause register. Since exception handlers do not usually 
use the floating-point registers, the general entry point does not save them, leav- 
ing that to the few handlers that need them. 

Figure 7.28 sketches the MIPS code of an exception handler. Note that we save 
and restore the state in MIPS code, taking care when we enable and disable excep- 
tions, but we invoke C code to handle the particular exception. 
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Save state D 


Save GPR 


addi 

sw 

sw 

* ^ * 

sw 


Ski, 

$sp. 
$v0. 

$ra. 


$sp. -XCPSIZE 
XCT SP($kl) 
XCT_V0($kl) 

XCT_RA($kl) 


t 
t 

# 


save 
save 
save 
save 
save 


space on stack for state 

$sp on stack 

$v0 on stack 

$vl, $a/, $si , $ti , ...on stack 

$ra on stack 


Save Hi, Lo 


mfhi 
mf lo 
sw 
sw 


$v0 
$vl 
$v0, 
$vl. 


XCT HI($kl) 
XCT_LI($kl) 


* 


copy 
copy 
save 
save 


Hi 

Lo 

Hi value on stack 

Lo value on stack 


Save Exception 
Registers 


mfcO 
sw 

mfcO 
sw 


$a0. 
$a0. 

$a3. 
$a3. 


$cr 
XCT_CR($kl) 

$sr 
XCT_SR($kl) 


* 
f 


copy 
save 
save 
copy 
save 


cause register 
$cr value on stack 
$vl. .... 
Status Register 
$sr on stack 


Set sp 


move 


$sp. 


$kl 


f 


sp = 


sp - XCPSIZE 


Enable nested exceptions Q 




andi 
mtcO 


$v0. 
$v0. 


$a3. MASK1 
$sr 


# 


$v0 - 
$sr - 


= $sr & MASK1, enable exceptions 
= value that enables exceptions 


Call C exception handler | 


Set $gp 


move 


*gp. 


GPINIT 


# 


set : 


Egp to point to heap area 


Call C code 


move 
jal 


$a0. 
xcpt. 


$sp 
_deli ver 




argl 
call 


= pointer to exception stack 
C code to handle exception 



Restoring state 



Restore most 
GPR. Hi . Lo 



move Sat, $sp 

lw $ra. XCT_RA($at) 

lw" $a0. XCT A0($kl) 



# temporary value of $sp 

# restore $ra from stack 

# restore $t0, Sal 

# restore $a0 from stack 



Restore Status 
Register 



lw $v0, XCT_SR($at) 

li $vl, MASK2 

and $v0, $v0, $vl 

mtcO $v0. $sr 



# load old $sr from stack 

# mask to disable exceptions 

# $v0 = $sr & MASK2. disenable exceptions 

# set Status Register 



Exception return 



Restore $sp and 
rest of GPR 
used as 
temporary 
regi sters 



lw 
lw 
lw 
lw 
lw 



sp, XCT_SP($at) 

vO. XCT_V0($at) 

IVl, XCT_VK$at) 

Ikl, XCT_EPCC$at 

at. XCT AT($at) 



# restore $sp from stack 

# restore $v0 from stack 

# restore $vl from stack 

# copy old $epc from stack 

# restore Sat from stack 



Restore ERC and 
return 



mtcO 
eret 



kl. $epc 
ra 



# restore $epc 

# return to interrupted instruction 



FIGURE 7.28 MIPS code to save and restore state on an exception. 



The virtual address that caused the fault depends on whether the fault was an 
instruction or data fault. The address of the instruction that generated the fault is in 
the EPC. If it was an instruction page fault, the EPC contains the virtual address of the 
faulting page; otherwise, the faulting virtual address can be computed by examining 
the instruction (whose address is in the EPC) to find the base register and offset field. 



Elaboration: This simplified version assumes that the stack pointer (sp) is valid. To 

avoid the problem of a page fault during this low-level exception code, MIPS sets 

unmapped A portion of the aside a portion of its address space that cannot have page faults, called unmapped. 

address space that cannot have The operating system places exception entry point code and the exception stack in 

page faults. unmapped memory. MIPS hardware translates virtual addresses 8000 0000^ to 

BFFF FFFF hex to physical addresses simply by ignoring the upper bits of the virtual 
address, thereby placing these addresses in the low part of physical memory. Thus, 
the operating system places exception entry points and exception stacks in 
unmapped memory. 

Elaboration: The code in Figure 7.28 shows the MIPS-32 exception return sequence. 
MIPS-I uses rfe and jr instead of eret. 

Summary 

Virtual memory is the name for the level of memory hierarchy that manages cach- 
ing between the main memory and disk. Virtual memory allows a single program 
to expand its address space beyond the limits of main memory. More importantly, 
in recent computer systems virtual memory supports sharing of the main mem- 
ory among multiple, simultaneously active processes, which together require far 
more total physical main memory than exists. To support sharing, virtual mem- 
ory also provides mechanisms for memory protection. 

Managing the memory hierarchy between main memory and disk is challeng- 
ing because of the high cost of page faults. Several techniques are used to reduce 
the miss rate: 

1. Blocks, called pages, are made large to take advantage of spatial locality and 
to reduce the miss rate. 

2. The mapping between virtual addresses and physical addresses, which is 
implemented with a page table, is made fully associative so that a virtual 
page can be placed anywhere in main memory. 

3. The operating system uses techniques, such as LRU and a reference bit, to 
choose which pages to replace. 

Writes to disk are expensive, so virtual memory uses a write-back scheme and also 
tracks whether a page is unchanged (using a dirty bit) to avoid writing unchanged 
pages back to disk. 

The virtual memory mechanism provides address translation from a virtual 
address used by the program to the physical address space used for accessing 
memory. This address translation allows protected sharing of the main memory 
and provides several additional benefits, such as simplifying memory allocation. 
To ensure that processes are protected from each other requires that only the 



7.4 Virtual Memory 



537 



operating system can change the address translations, which is implemented by 
preventing user programs from changing the page tables. Controlled sharing of 
pages among processes can be implemented with the help of the operating sys- 
tem and access bits in the page table that indicate whether the user program has 
read or write access to a page. 

If a processor had to access a page table resident in memory to translate every 
access, virtual memory would have too much overhead and caches would be 
pointless! Instead, a TLB acts as a cache for translations from the page table. 
Addresses are then translated from virtual to physical using the translations in 
the TLB. 

Caches, virtual memory, and TLBs all rely on a common set of principles and 
policies. The next section discusses this common framework. 



Although virtual memory was invented to enable a small memory to act as a 
large one, the performance difference between disk and memory means that if 
a program routinely accesses more virtual memory than it has physical mem- 
ory it will run very slowly. Such a program would be continuously swapping 
pages between memory and disk, called thrashing. Thrashing is a disaster if it 
occurs, but it is rare. If your program thrashes, the easiest solution is to run it 
on a computer with more memory or buy more memory for your computer. A 
more complex choice is to reexamine your algorithm and data structures to 
see if you can change the locality and thereby reduce the number of pages that 
your program uses simultaneously. This set of pages is informally called the 
working set. 

A more common performance problem is TLB misses. Since a TLB might han- 
dle only 32-64 page entries at a time, a program could easily see a high TLB miss 
rate, as the processor may access less than a quarter megabyte directly: 64 X 4 KB 
= 0.25 MB. For example, TLB misses are often a challenge for Radix Sort. To try to 
alleviate this problem, most computer architectures now support variable page 
sizes. For example, in addition to the standard 4 KB page, MIPS hardware sup- 
ports 16 KB, 64 KB, 256 KB, 1 MB, 4 MB, 16 MB, 64 MB, and 256 MB pages. 
Hence, if a program uses large page sizes, it can access more memory directly 
without TLB misses. 

The practical challenge is getting the operating system to allow programs to 
select these larger page sizes. Once again, the more complex solution to reducing 
TLB misses is to reexamine the algorithm and data structures to reduce the work- 
ing set of pages. 



Understanding 

Program 

Performance 
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Check 
Yourself 



Match the memory hierarchy element on the left with the closest phrase on the right: 



1. LI cache 

2. L2 cache 

3. Main memory 

4. TLB 



a. A cache for a cache 

b. A cache for disks 

c. A cache for a main memory 

d. A cache for page table entries 



7.5 



A Common Framework for Memory 
Hierarchies 



By now, you've recognized that the different types of memory hierarchies share a 
great deal in common. Although many of the aspects of memory hierarchies differ 
quantitatively, many of the policies and features that determine how a hierarchy 
functions are similar qualitatively. Figure 7.29 shows how some of the quantitative 
characteristics of memory hierarchies can differ. In the rest of this section, we will 
discuss the common operational aspects of memory hierarchies and how these 
determine their behavior. We will examine these policies as a series of four ques- 
tions that apply between any two levels of a memory hierarchy, although for sim- 
plicity we will primarily use terminology for caches. 

Question 1: Where Can a Block Be Placed? 

We have seen that block placement in the upper level of the hierarchy can use a range 
of schemes, from direct mapped to set associative to fully associative. As mentioned 
above, this entire range of schemes can be thought of as variations on a set-associa- 
tive scheme where the number of sets and the number of blocks per set varies: 



Scheme name 


Number of sets 


Blocks per set jj 


Direct mapped 


Number of blocks in cache 


1 


Set associative 


Number of blocks in cache 
Associativity 


Associativity (typically 2-16) 


Fully associative 


1 


Number of blocks in the cache 



The advantage of increasing the degree of associativity is that it usually 
decreases the miss rate. The improvement in miss rate comes from reducing 
misses that compete for the same location. We will examine these in more detail 
shortly. First, let's look at how much improvement is gained. Figure 7.30 shows 
the data for a workload consisting of the SPEC2000 benchmarks with caches of 4 
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Feature 


Typical values 
for LI caches 


Typical values 
for L2 caches 


Typical values for 
paged memory 


Typical values 
for a TLB 


Total size in blocks 


250-2000 


4000-250,000 


16,000-250,000 


16-512 


Total size in kilobytes 


16-64 


500-8000 


250,000-1,000,000,000 


0.25-16 


Block size in bytes 


32-64 


32-128 


4000-64,000 


4-32 


Miss penalty in clocks 


10-25 


100-1000 


10,000,000-100,000,000 


10-1000 


Miss rates (global for L2) 


2%-5% 


0.196-2% 


0.00001%-0.0001% 


0.01%-2% 



FIGURE 7.29 The key quantitative design parameters that characterize the major elements of memory hierarchy in a com- 
puter. These are typical values for these levels as of 2004. Although the range of values is wide, this is partially because many of the values that have 
shifted over time are related; for example, as caches become larger to overcome larger miss penalties, block sizes also grow. 



15%-, 



12%- 



o 9%- 






(0 

.22 



6% - 



3%- 



0- 




16 KB 
32KB 



64KB 



128KB 



One-way Two-way Four-way Eight-way 

Associativity 



FIGURE 7.30 The data cache miss rates for each of eight cache sizes improve as the 
associativity increases. While the benefit of going from one-way (direct-mapped) to two-way set 
associative is significant, the benefits of further associativity are smaller (e.g., 1%-10% going from two-way 
to four-way versus 20%-30% improvement going from one-way to two-way). There is even less improve- 
ment in going from four-way to eight-way set associative, which, in turn, comes very close to the miss rates 
of a fully associative cache. Smaller caches obtain a significantly larger absolute benefit from associativity 
because the base miss rate of a small cache is larger. Figure 7.15 explains how this data was collected. 



KB to 512 KB, varying from direct mapped to eight-way set associative. The larg- 
est gains are obtained in going from direct mapped to two-way set associative, 
which yields between a 20% and 30% reduction in the miss rate. As cache sizes 
grow, the relative improvement from associativity increases only slightly; since the 
overall miss rate of a larger cache is lower, the opportunity for improving the miss 
rate decreases and the absolute improvement in the miss rate from associativity 
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shrinks significantly. The potential disadvantages of associativity, as we men- 
tioned earlier, are increased cost and slower access time. 

Question 2: How Is a Block Found? 

The choice of how we locate a block depends on the block placement scheme, 
since that dictates the number of possible locations. We can summarize the 
schemes as follows: 



Associativity 



Location method 



Comparisons required 



Direct mapped 


index 


1 


Set associative 


index the set, search among elements 


degree of associativity 


Full 


search all cache entries 


size of the cache 


separate lookup table 






The choice among direct-mapped, set-associative, or fully associative mapping 
in any memory hierarchy will depend on the cost of a miss versus the cost of 
implementing associativity, both in time and in extra hardware. Including the L2 
cache on the chip enables much higher associativity, because the hit tunes are not 
as critical and the designer does not have to rely on standard SRAM chips as the 
building blocks. Fully associative caches are prohibitive except for small sizes, 
where the cost of the comparators is not overwhelming and where the absolute 
miss rate improvements are greatest. 

In virtual memory systems, a separate mapping table (the page table) is kept to 
index the memory. In addition to the storage required for the table, using an index 
table requires an extra memory access. The choice of full associativity for page 
placement and the extra table is motivated by four facts: 

1. Full associativity is beneficial, since misses are very expensive. 

2. Full associativity allows software to use sophisticated replacement schemes 
that are designed to reduce the miss rate. 

3. The full map can be easily indexed with no extra hardware and no search- 
ing required. 

4. The large page size means the page table size overhead is relatively small. 
(The use of a separate lookup table, like a page table for virtual memory, is 
not practical for a cache because the table would be much larger than a page 
table and could not be accessed quickly.) 

Therefore, virtual memory systems almost always use fully associative placement. 
Set-associative placement is often used for caches and TLBs, where the access 
combines indexing and the search of a small set. A few systems have used direct- 
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mapped caches because of their advantage in access time and simplicity. The 
advantage in access time occurs because finding the requested block does not 
depend on a comparison. Such design choices depend on many details of the 
implementation, such as whether the cache is on-chip, the technology used for 
implementing the cache, and the critical role of cache access time in determining 
the processor cycle time. 

Question 3: Which Block Should Be Replaced 
on a Cache Miss? 

When a miss occurs in an associative cache, we must decide which block to 
replace. In a fully associative cache, all blocks are candidates for replacement. If 
the cache is set associative, we must choose among the blocks in the set. Of course, 
replacement is easy in a direct-mapped cache because there is only one candidate. 
We have already mentioned the two primary strategies for replacement in set- 
associative or fully associative caches: 

■ Random: Candidate blocks are randomly selected, possibly using some 
hardware assistance. For example, MIPS supports random replacement for 

TLB misses. 

■ Least recently used (LRU): The block replaced is the one that has been 
unused for the longest time. 

In practice, LRU is too costly to implement for hierarchies with more than a 
small degree of associativity (two to four, typically), since tracking the usage 
information is costly. Even for four-way set associativity, LRU is often approxi- 
mated — for example, by keeping track of which of a pair of blocks is LRU (which 
requires 1 bit), and then tracking which block in each pair is LRU (which requires 
1 bit per pair). 

For larger associativity, either LRU is approximated or random replacement is 
used. In caches, the replacement algorithm is in hardware, which means that the 
scheme should be easy to implement. Random replacement is simple to build in 
hardware, and for a two-way set-associative cache, random replacement has a 
miss rate about 1.1 times higher than LRU replacement. As the caches become 
larger, the miss rate for both replacement strategies falls, and the absolute differ- 
ence becomes small. In fact, random replacement can sometimes be better than 
the simple LRU approximations that are easily implemented in hardware. 

In virtual memory, some form of LRU is always approximated since even a tiny 
reduction in the miss rate can be important when the cost of a miss is enormous. 
Reference bits or equivalent functionality is often provided to make it easier for 
the operating system to track a set of less recently used pages. Because misses are 
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so expensive and relatively infrequent, approximating this information primarily 
in software is acceptable. 

Question 4: What Happens on a Write? 

A key characteristic of any memory hierarchy is how it deals with writes. We have 
already seen the two basic options: 

■ Write- through: The information is written to both the block in the cache and 
to the block in the lower level of the memory hierarchy (main memory for a 
cache). The caches in Section 7.2 used this scheme. 

■ Write-back (also called copy-hack): The information is written only to the 
block in the cache. The modified block is written to the lower level of the 
hierarchy only when it is replaced. Virtual memory systems always use 
write-back, for the reasons discussed in Section 7.4. 

Both write-back and write-through have their advantages. The key advantages 
of write-back are the following: 

■ Individual words can be written by the processor at the rate that the cache, 
rather than the memory, can accept them. 

■ Multiple writes within a block require only one write to the lower level in 
the hierarchy. 

■ When blocks are written back, the system can make effective use of a high- 
bandwidth transfer, since the entire block is written. 

Write-through has these advantages: 

■ Misses are simpler and cheaper because they never require a block to be 
written back to the lower level. 

■ Write-through is easier to implement than write-back, although to be prac- 
tical in a high-speed system, a write-through cache will need to use a write 
buffer. 

In virtual memory systems, only a write-back policy is practical because of the 
long latency of a write to the lower level of the hierarchy (disk). As processors con- 
tinue to increase in performance at a faster rate than DRAM-based main memory, 
the rate at which writes are generated by a processor will exceed the rate at which 
the memory system can process them, even allowing for physically and logically 
wider memories. Consequently, more and more caches are using a write-back 
strategy. 
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While caches, TLBs, and virtual memory may initially look very ditterent, 
they rely on the same two principles of locality and can be understood by 
looking at how they deal with four questions: 



Question 1: 
Answer: 

Question 2: 
Answer: 




Question 3: 
Answer: 

Question 4: 
Answer: 



Where can a block be placed? 

One place (direct mapped), a few places (set associative), 

or any place (fully associative). 

How is a block found? 

There are four methods: indexing (as in a direct-mapped 
cache), limited search (as in a set -associative cache), full 
search (as in a fully associative cache), and a separate 
lookup table (as in a page table). 

What block is replaced on a miss? 

Typically, either the least recently used or a random block. 

How are writes handled? 

Each level in the hierarchy can use either write-through 

or write-back. 



The Three Cs: An Intuitive Model for Understanding the 
Behavior of Memory Hierarchies 

In this section, we look at a model that provides insight into the sources of misses 
in a memory hierarchy and how the misses will be affected by changes in the hier- 
archy. We will explain the ideas in terms of caches, although the ideas carry over 
directly to any other level in the hierarchy. In this model, all misses are classified 
into one of three categories (the three Cs): 

■ Compulsory misses: These are cache misses caused by the first access to a 
block that has never been in the cache. These are also called cold-start misses. 

■ Capacity misses: These are cache misses caused when the cache cannot con- 
tain all the blocks needed during execution of a program. Capacity misses 
occur when blocks are replaced and then later retrieved. 

■ Conflict misses: These are cache misses that occur in set-associative or 
direct-mapped caches when multiple blocks compete for the same set. Con- 
flict misses are those misses in a direct-mapped or set-associative cache that 
are eliminated in a fully associative cache of the same size. These cache 
misses are also called collision misses. 



The BIG 

Picture 



three Cs model A cache model 
in which all cache misses are 
classified into one of three cate- 
gories: compulsory misses, 
capacity misses, and conflict 
misses. 

compulsory miss Also called 
cold start miss. A cache miss 
caused by the first access to a 
block that has never been in the 
cache. 

capacity miss A cache miss 
that occurs because the cache, 
even with full associativity, can- 
not contain all the block needed 
to satisfy the request 

conflict miss Also called colli- 
sion miss. A cache miss that 
occurs in a set-associative or 
direct-mapped cache when mul- 
tiple blocks compete for the 
same set and that are eliminated 
in a fully associative cache of the 
same size. 
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FIGURE 7.31 The miss rate can be broken into three sources of misses. This graph shows 

the total miss rate and its components for a range of cache sizes. This data is for the SPEC2000 integer and 
floating-point benchmarks and is from the same source as the data in Figure 7.30. The compulsory miss 
component is 0.006% and camiot be seen in this graph. The next component is the capacity miss rate, 
which depends on cache size. The conflict portion, which depends both on associativity and on cache size, 
is shown for a range of associativities from one-way to eight -way. In each case, the labeled section corre- 
sponds to the increase in the miss rate that occurs when the associativity is changed from the next higher 
degree to the labeled degree of associativity. For example, the section labeled two-way indicates the addi- 
tional misses arising when the cache has associativity of two rather than four. Thus, the difference in the 
miss rate incurred by a direct-mapped cache versus a fully associative cache of the same size is given by the 
sum of the sections marked eight-way, four- way, two-way, and one-way. The difference between eight-way 
and four-way is so small that it is difficult to see on this graph. 



Figure 7.31 shows how the miss rate divides into the three sources. These 
sources of misses can be directly attacked by changing some aspect of the cache 
design. Since conflict misses arise directly from contention for the same cache 
block, increasing associativity reduces conflict misses. Associativity, however, may 
slow access time, leading to lower overall performance. 

Capacity misses can easily be reduced by enlarging the cache; indeed, second- 
level caches have been growing steadily larger for many years. Of course, when we 
make the cache larger, we must also be careful about increasing the access time, 
which could lead to lower overall performance. Thus, first-level caches have been 
growing slowly if at all. 
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Design change 


Effect on miss rate 


Possible negative | 
performance effect i 




Increase cache size 


decreases capacity misses 


may increase access time 


Increase associativity 


decreases miss rate due to conflict 
misses 


may increase access time 


Increase block size 


decreases miss rate for a wide range of 
block sizes due to spatial locality 


increases miss penalty. Very large 
block could increase miss rate 



FIGURE 7.32 Memory hierarchy design challenges. 

Because compulsory misses are generated by the first reference to a block, the 
primary way for the cache system to reduce the number of compulsory misses is 
to increase the block size. This will reduce the number of references required to 
touch each block of the program once because the program will consist of fewer 
cache blocks. Increasing the block size too much can have a negative effect on per- 
formance because of the increase in the miss penalty. 

The decomposition of misses into the three Cs is a useful qualitative model. In 
real cache designs, many of the design choices interact, and changing one cache 
characteristic will often affect several components of the miss rate. Despite such 
shortcomings, this model is a useful way to gain insight into the performance of 
cache designs. 




The challenge in designing memory hierarchies is that every change that 
potentially improves the miss rate can also negatively affect overall perfor- 
mance, as Figure 7.32 summarizes. This combination of positive and nega- 
tive effects is what makes the design of a memory hierarchy interesting. 







The BIG 

Picture 



Which of the following statements (if any) are generally true? 

1. There is no way to reduce compulsory misses. 

2. Fully associate caches have no conflict misses. 

3. In reducing misses, associativity is more important than capacity. 



Check 
Yourself 
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7.6 



Real Stuff: The Pentium P4 and the AMD 
Opteron Memory Hierarchies 



In this section, we will look at the memory hierarchy in two modern microproces- 
sors: the Intel Pentium P4 and the AMD Opteron processor. In 2004, the P4 is 
used in a variety of PC desktops and small servers. The AMD Opteron processor is 
finding its way into higher-end servers and clusters. 

Figure 7.33 shows the Opteron die photo, and Figure 1.9 on page 21 in Chapter 
1 shows the P4 die photo. Both have secondary caches on the main processor die. 
Such integration reduces access time to the secondary cache and also reduces the 
number of pins on the chip> since there is no need for a bus to an external second- 
ary cache. 




FIGURE 7.33 An AMD Opteron die processor photo with the components labeled. The L2 

cache occupies 42% of the die. The remaining components in order of size are HyperTransport™: 13%, 
DDR memory: 10%, Fetch/Scan/Align/Microcode: 6%, Memory controller: 4%, FPU: 4%, Instruction 
cache: 4%, Data cache: 4%, Execution units: 3%, Bus unit: 2%, and clock generator: 0.2%. In a 0.13 tech- 
nology, this die is 193 mm2. 
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The Memory Hierarchies of the P4 and Opteron 

Figure 7.34 summarizes the address sizes and TLBs of the two processors. 

Note that the AMD Opteron has four TLBs while the P4 has two and that the 
virtual and physical addresses do not have to match the word size. AMD imple- 
ments only 48 of the potential 64 bits of its virtual space and 40 of the potential 64 
bits of its physical address space. Intel increases the physical address space to 36- 
bits, although no single program can address more than 32 bits. 

Figure 7.35 shows their caches. Note that both the LI data cache and the L2 
caches are larger in the Opteron and that P4 uses a larger block size for its L2 
cache than its LI data cache. 

Although the Opteron runs the same IA-32 programs as the Pentium P4, its 
biggest difference is that it has added a 64-bit addressing mode. Just as the 80386 
added a flat 32-bit address space and 32-bit registers to the prior 16-bit 80286 
architecture, Opteron adds a new mode with flat 64-bit address space and 64-bit 
registers to the IA-32 architecture, called AMD64. It increases the program 
counter to 64 bits, extends eight 32-bit registers to 64 bits, adds eight new 64-bit 
registers, and doubles the number of SSE2 registers. In 2004 Intel announced that 
future IA-32 processors will include their 64-bit address extension. 

Techniques to Reduce Miss Penalties 

Both the Pentium 4 and the AMD Opteron have additional optimizations that 
allow them to reduce the miss penalty. The first of these is the return of the 



Characteristic Intel Pentium P4 


AMD Opteron 


Virtual address 


32 bits 


48 bits 


Physical address 


36 bits 


40 bits 


Page size 


4 KB, 2/4 MB 


4 KB, 2/4 MB 


TLB organization 


1 TLB for instructions and 1 TLB for 
data 

Both are four-way set associative 

Both use pseudo-LRU replacement 

Both have 128 entries 

TLB misses handled in hardware 


2 TLBs for instructions and 2 TLBs for data 

Both LI TLBs fully associative, LRU 
replacement 

Both L2 TLBs are four-way set associativity, 
round-robin LRU 

Both LI TLBs have 40 entries 

Both L2 TLBs have 512 entries 

TLB misses handled in hardware 



FIGURE 7.34 Address translation and TLB hardware for the Intel Pentium P4 and AMD 
Opteron. The word size sets the maximum size of the virtual address, but a processor need not use all bits. 
The physical address size is independent of word size. The P4 has one TLB for instructions and a separate 
identical TLB for data, while the Opteron has both an LI TLB and an L2 TLB for instructions and identical 
LI and L2 TLBs for data. Both processors provide support for large pages, which are used for tilings like the 
operating system or mapping a frame buffer. The large-page scheme avoids using a large number of entries 
to map a single object that is always present. 



548 



Chapter 7 Large and Fast: Exploiting Memory Hierarchy 



Characteristic 


Intel Pentium P4 


AMD Opteron 1 


LI cache organization 


Split instruction and data caches 


Split instruction and data caches 


LI cache size 


8 KB for data, 96 KB trace cache for 
RISC instructions (12K RISC operations) 


64 KB each for instructions/data 


LI cache associativity 


4-way set associative 


2-way set associative 


LI replacement 


Approximated LRU replacement 


LRU replacement 


LI block size 


64 bytes 


64 bytes 


LI write policy 


Write-through 


Write-back 


L2 cache organization 


Unified (instruction and data) 


Unified (instruction and data) 


L2 cache size 


512KB 


1024 KB (1 MB) 


L2 cache associativity 


8-way set associative 


16-way set associative 


L2 replacement 


Approximated LRU replacement 


Approximated LRU replacement 


L2 block size 


128 bytes 


64 bytes 


L2 write policy 


Write-back 


Write-back 



FIGURE 7.35 First-level and second-level caches in the Intel Pentium P4 and AMD 
Opteron. The primary caches in the P4 are physically indexed and tagged; for a discussion of the alterna- 
tive see the Elaboration on page 527. 



nonblocking cache A cache 
that allows the processor to 
make references to the cache 
while the cache is handling an 
earlier miss. 



requested word first on a miss, as described in the Elaboration on page 490. Both 
allow the processor to continue to execute instructions that access the data cache 
during a cache miss. This technique, called a nonblocking cache, is commonly 
used as designers attempt to hide the cache miss latency by using out-of-order pro- 
cessors. They implement two flavors of nonblocking. Hit under miss allows addi- 
tional cache hits during a miss, while miss under miss allows multiple outstanding 
cache misses. The aim of the first of these two is hiding some miss latency with 
other work, while the aim of the second is overlapping the latency of two different 
misses. 

Overlapping a large fraction of miss times for multiple outstanding misses 
requires a high-bandwidth memory system capable of handling multiple misses in 
parallel. In desktop systems, the memory may only be able to take limited advan- 
tage of this capability, but large servers and multiprocessors often have memory 
systems capable of handling more than one outstanding miss in parallel. 

Both microprocessors prefetch instructions and have a built-in hardware 
prefetch mechanism for data accesses. They look at a pattern of data misses and use 
this information to try to predict the next address to start fetching the data before 
the miss occurs. Such techniques generally work best when accessing arrays in loops. 

A significant challenge facing cache designers is to support processors like the P4 
and Opteron that can execute more than one memory instruction per clock cycle. 
Multiple requests can be supported in the first-level cache by two different tech- 
niques. The cache can be multiported, allowing more than one simultaneous access 
to the same cache block. Multiported caches, however, are often too expensive, since 
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the RAM cells in a multiported memory must be much larger than single-ported 
cells. The alternative scheme is to break the cache into banks and allow multiple, 
independent accesses, provided the accesses are to different banks. The technique is 
similar to interleaved main memory (see Figure 7. 1 1 on page 489). 

To reduce the memory traffic in a multiprocessor configuration, Intel has other 
versions of the P4 with much larger on-chip caches in 2004. For example, the Intel 
Pentium P4 Xeon comes with third-level cache on chip of 1 MB and is intended for 
dual-processor servers. A more radical example is the Intel Pentium P4 Extreme 
Edition, which comes with 2 MB of L3 cache but no support for multiprocessing. 
These two chips are much larger and more expensive. For example, in 2004 a Preci- 
sion Workstation 360 with a 3.2 GHz P4 costs about $1900. Upgrading to the 
Extreme Edition processor adds $500 to the price. The Dell Precision Workstation 
450, which allows dual processors, costs about $2000 for a 3.2 GHz Xeon with 1 MB 
of L3 cache. Adding a second processor like that one adds $1500 to the price. 

The sophisticated memory hierarchies of these chips and the large fraction of 
the dies dedicated to caches and TLBs show the significant design effort expended 
to try to close the gap between processor cycle times and memory latency. Future 
advances in processor pipeline designs, together with the increased use of multi- 
processing that presents its own problems in memory hierarchies, provide many 
new challenges for designers. 

Elaboration: Perhaps the largest difference between the AMD and Intel chips is the 
use of a trace cache for the P4 instruction cache, while the AMD Opteron uses a more 
traditional instruction cache. 

Instead of organizing the instructions in a cache block sequentially to promote spa- 
tial locality, a trace cache finds a dynamic sequence of instructions including taken 
branches to load into a cache block. Thus, the cache blocks contain dynamic traces of 
the executed instructions as determined by the CPU rather than static sequences of 
instructions as determined by memory layout. It folds branch prediction (Chapter 6) into 
the cache, so the branches must be validated along with the addresses in order to have 
a valid fetch. In addition, the P4 caches the micro-operations (see Chapter 5) rather 
than the IA-32 instructions as in the Opteron. 

Clearly, trace caches have much more complicated address mapping mechanisms, 
since the addresses are no longer aligned to power-of-two multiples of the word size. 

Trace caches can improve utilization of cache blocks, however. For example, very 
long blocks in conventional caches may be entered from a taken branch, and hence the 
first portion of the block occupies space in the cache that might not be fetched. Simi- 
larly, such blocks may be exited by taken branches, so the last portion of the block 
might be wasted. Given that taken branches or jumps occur every 5-10 instructions, 
effective block utilization is a real problem for processors like the Opteron, whose 64- 
byte block would likely include 16-24 80x86 instructions. Trace caches store instruc- 
tions only from the branch entry point to the exit of the trace, thereby avoiding such 
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header and trailer overhead. A downside of trace caches is that they potentially store 
the same instructions multiple times in the cache: conditional branches making differ- 
ent choices result in the same instructions being part of separate traces, which each 
appear in the cache. 

To account for both the larger size of the micro-operations and the redundancy inher- 
ent in a trace cache, Intel claims that the miss rate of the 96 KB trace cache of the P4, 
which holds 12K micro-operations, is about that of an 8 KB cache, which holds about 
2K-3K IA-32 instructions. 



7.7 



Fallacies and Pitfalls 



As one of the most naturally quantitative aspects of computer architecture, the 
memory hierarchy would seem to be less vulnerable to fallacies and pitfalls. Not 
only have there been many fallacies propagated and pitfalls encountered, but 
some have led to major negative outcomes. We start with a pitfall that often traps 
students in exercises and exams. 

Pitfall: Forgetting to account for byte addressing or the cache block size in simu- 
lating a cache. 

When simulating a cache (by hand or by computer), we need to make sure we 
account for the effect of byte addressing and multiword blocks in determining 
which cache block a given address maps into. For example, if we have a 32-byte 
direct-mapped cache with a block size of 4 bytes, the byte address 36 maps into 
block 1 of the cache, since byte address 36 is block address 9 and (9 modulo 8) = 1. 
On the other hand, if address 36 is a word address, then it maps into block (36 
mod 8) = 4. Make sure the problem clearly states the base of the address. 

In like fashion, we must account for the block size. Suppose we have a cache 
with 256 bytes and a block size of 32 bytes. Which block does the byte address 
300 tall into? If we break the address 300 into fields, we can see the answer: 
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Byte address 300 is block address 



300 
32 



= 9 



The number of blocks in the cache is 



236 
32 



= 8 



Block number 9 falls into cache block number (9 modulo 8) = 1. 

This mistake catches many people, including the authors (in earlier drafts) and 
instructors who forget whether they intended the addresses to be in words, bytes, 
or block numbers. Remember this pitfall when you tackle the exercises. 

Pitfall: Ignoring memory system behavior when writing programs or when gener- 
ating code in a compiler. 

This could easily be written as a fallacy: "Programmers can ignore memory hierar- 
chies in writing code." We illustrate with an example using matrix multiply, to 
complement the sort comparison in Figure 7. 18 on page 508. 

Here is the inner loop of the version of matrix multiply from Chapter 3: 

for (i=0; i!=500; 1=1+1) 
for (j=0; j!=500; j=j+l) 
for (k=0; k!=500; k=k+l) 

x[i][j] = x[i][j] + y[i][k] * z[k][j]; 

When run with inputs that are 500 x 500 double precision matrices, the CPU 
runtime of the above loop on a MIPS CPU with a 1 MB secondary cache was 
about half the speed compared to when the loop order is changed to k , j , i (so i 
is innermost)! The only difference is how the program accesses memory and the 
ensuing effect on the memory hierarchy. Further compiler optimizations using a 
technique called blocking can result in a runtime that is another four times faster 
for this code! 

Pitfall: Using average memory access time to evaluate the memory hierarchy of an 
out-of-order processor. 

If a processor stalls during a cache miss, then you can separately calculate the 
memory-stall time and the processor execution time, and hence evaluate the 
memory hierarchy independently using average memory access time. 

If the processor continues to execute instructions and may even sustain more 
cache misses during a cache miss, then the only accurate assessment of the mem- 
ory hierarchy is to simulate the out-of-order processor along with the memory 
hierarchy. 
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Pitfall: Extending an address space by adding segments on top of an unsegmented 1 
address space. 1 

During the 1970s, many programs grew so large that not all the code and data 1 
could be addressed with just a 16-bit address. Computers were then revised to 1 
offer 32-bit addresses, either through an unsegmented 32-bit address space (also 1 
called a flat address space) or by adding 16 bits of segment to the existing 16-bit 1 
address. From a marketing point of view, adding segments that were program- 1 
mer-visible and that forced the programmer and compiler to decompose pro- 1 
grams into segments could solve the addressing problem. Unfortunately, there is 1 
trouble any time a programming language wants an address that is larger than one 1 
segment, such as indices for large arrays, unrestricted pointers, or reference 1 
parameters. Moreover, adding segments can turn every address into two words — 1 
one for the segment number and one for the segment offset — causing problems in 1 
the use of addresses in registers. Given the size of DRAMs and Moore's law, many 1 
of today's 32-bit systems are facing similar problems. 1 




7.8 


Concluding Remarks 




The difficulty of building a memory system to keep pace with faster processors is 1 
underscored by the fact that the raw material for main memory, DRAMs, is essen- 1 
tially the same in the fastest computers as it is in the slowest and cheapest. Figure 1 
7.36 compares the memory hierarchy of microprocessors aimed at desktop, server, 1 
and embedded applications. The LI caches are similar across applications, with 1 
the primary differences being L2 cache size, die size, processor clock rate, and 1 
instructions issued per clock. 1 

It is the principle of locality that gives us a chance to overcome the long latency 1 
of memory access — and the soundness of this strategy is demonstrated at all levels 1 
of the memory hierarchy. Although these levels of the hierarchy look quite differ- 1 
ent in quantitative terms, they follow similar strategies in their operation and 1 
exploit the same properties of locality. 1 

Because processor speeds continue to improve faster than either DRAM access 1 
times or disk access times, memory will increasingly be the factor that limits per- 1 
formance. Processors increase in performance at a high rate, and DRAMs are now 1 
doubling their density about every two years. The access time of DRAMs, however, 1 
is improving at a much slower rate — less than 10% per year. Figure 7.37 plots pro- 1 
cessor performance against a 7% annual performance improvement in DRAM 1 
latency. While latency improves slowly, recent enhancements in DRAM technol- 1 
ogy (double data rate DRAMs and related techniques) have led to greater 1 
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MPU 


AMD 
Opteron 


Intrinsity 
Fast MATH 


Intel Pentium 4 


Intel PXA250 


Sun i 
UltraSPARC IV 1 


Instruction set architecture 


IA-32, AMD64 


MIPS32 


IA-32 


ARM 


SPARC v9 


Intended application 


server 


embedded 


desktop 


low-power embedded 


server 


Die size (mm 2 ) (2004) 


193 


122 


217 




356 


Instructions issued/clock 


3 


2 


3 RISC ops 


1 


4x2 


Clock rate (2004) 


2.0 GHz 


2.0 GHz 


3.2 GHz 


0.4 GHz 


1.2 GHz 


Instruction cache 


64KB, 
2-way set 
associative 


16KB, 

direct mapped 


12000 RISC op trace 
cache (~96 KB) 


32 KB, 
32-way set 

associative 


32KB, 
4-way set 
associative 


Latency (clocks) 


3? 


4 


4 


1 


2 


Data cache 


64KB, 
2-way set 
associative 


16KB, 

1-way 

set associative 


8KB, 

4-way 

set associative 


32KB, 
32-way set 

associative 


64KB, 
4-way set 
associative 


Latency (clocks) 


3 


3 


2 


1 


2 


TLB entries (I/D/L2 TLB) 


40/40/512/ 
512 
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FIGURE 7.36 Desktop, embedded, and server microprocessors in 2004. From a memory hierarchy perspective, the primary differences 
between categories is the L2 cache. There is no L2 cache for the low-power embedded, a iarge on-chip L2 for the embedded and desktop, and 16 MB 
off chip for the server. The processor clock rates also vary: 0.4 GHz for low-power embedded, 1 GHz or higher for the rest. Note that UltraSPARC IV 
has two processors on the chip. 

increases in memory bandwidth. This potentially higher memory bandwidth has 
enabled designers to increase cache block sizes with smaller increases in the miss 
penalty. 



Recent Trends 

The challenge in designing memory hierarchies to close this growing gap, as we 
noted in the Big Picture on page 545, is that all the hardware design choices for 
memory hierarchies have both a positive and negative effect on performance. This 
means that for each level of the hierarchy there is an optimal performance point 
per program, which must include some misses. If this is the case, how can we 
overcome the growing gap between processor speeds and lower levels of the hier- 
archy? This question is currently the topic of much research. 



554 



Chapter 7 Large and Fast: Exploiting Memory Hierarchy 



100,000-, 



10,000- 



1,000 
Performance 



100- 




10- 



Year 

FIGURE 7.37 Using their 1980 performance as a baseline, the access time of DRAMs versus the performance of processors 
is plotted over time Note that the vertical axis must be on a logarithmic scale to record the size of the processor-DRAM performance gap. The 
memory baseline is 64 KB DRAM in 1 980, with three years to the next generation until 1 996 and two years thereafter, with a 7% per year performance 
improvement in latency. The processor line assumes a 35% improvement per year until 1986, and a 55% improvement until 2003. It slows thereafter. 



On-chip first-level caches initially helped close the gap that was growing 
between processor clock cycle time and off-chip SRAM cycle time. To narrow the 
gap between the small on-chip caches and DRAM, second-level caches became 
widespread. Today, all desktop computers use second-level caches on chip, and 
third-level caches are becoming popular in some segments. Multilevel caches also 
make it possible to use other optimizations more easily for two reasons. First, the 
design parameters of a second- or third-level cache are different from a first -level 
cache. For example, because a second- or third-level cache will be much larger, it 
is possible to use larger block sizes. Second, a second- or third-level cache is not 
constantly being used by the processor, as a first-level cache is. This allows us to 
consider having the second- or third-level cache do something when it is idle that 
may be useful in preventing future misses. 

Another possible direction is to seek software help. Efficiently managing the 
memory hierarchy using a variety of program transformations and hardware 
facilities is a major focus of compiler enhancements. Two different ideas are being 
explored. One idea is to reorganize the program to enhance its spatial and tempo- 
ral locality. This approach focuses on loop-oriented programs that use large arrays 
as the major data structure; large linear algebra problems are a typical example. By 
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restructuring the loops that access the arrays, substantially unproved locality — 
and, therefore, cache performance — can be obtained. The example on page 551 
showed how effective even a simple change of loop structure could be. 

Another direction is to try to use compiler-directed prefetching. In prefetch- 
ing, a block of data is brought into the cache before it is actually referenced. The 
compiler tries to identify data blocks needed in the future and, using special 
instructions, tells the memory hierarchy to move the blocks into the cache. When 
the block is actually referenced, it is found in the cache, rather than causing a 
cache miss. The use of secondary caches has made prefetching even more attrac- 
tive, since the secondary cache can be involved in a prefetch, while the primary 
cache continues to service processor requests. 

As we will see in ® Chapter 9, memory systems are also a central design issue 
for parallel processors. The growing importance of the memory hierarchy in 
determining system performance in both uniprocessor and multiprocessor sys- 
tems means that this important area will continue to be a focus of both designers 
and researchers for some years to come. 



prefetching A technique in 
which data blocks needed in the 
future are brought into the 
cache early by the use of special 
instructions that specify the 
address of the block. 
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Historical Perspective and Further 
Reading 



This history section | % gives an overview of memory technologies, from mercury 
delay lines to DRAM, the invention of the memory hierarchy and protection 
mechanisms, and concludes with a brief history of operating systems, including 
CTSS, MULTICS, UNIX, BSD UNIX, MS-DOS, Windows, and Linux. 
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7.1 [5] <§7.1> SRAM is commonly used to implement small, fast, on-chip caches 
while DRAM is used for larger, slower main memory. In the past, a common de- 
sign for supercomputers was to build machines with no caches and main memo- 
ries made entirely out of SRAM (the Cray C90, for example, a very fast computer 
in its day). If cost were no object, would you still want to design a system this way? 

7.2 [ 10] <§7.2> Describe the general characteristics of a program that would ex- 
hibit very little temporal and spatial locality with regard to data accesses. Provide 
an example program (pseudocode is fine). 



7.3 ( 10] <§7.2> Describe the general characteristics of a program that would ex- 
hibit very high amounts of temporal locality but very little spatial locality with re- 
gard to data accesses. Provide an example program (pseudocode is fine). 

7.4 ( 10] <§7.2> Describe the general characteristics of a program that would ex- 
hibit very little temporal locality but very high amounts of spatial locality with re- 
gard to data accesses. Provide an example program (pseudocode is fine). 

7.5 [3/3] <§7.2> A new processor can use either a write-through or write-back 
cache selectable through software. 

a. Assume the processor will run data intensive applications with a large num- 
ber of load and store operations. Explain which cache write policy should be 
used. 

b. Consider the same question but this time for a safety critical system in 
which data integrity is more important than memory performance. 

7.6 ( 10] <§7.2> S For More Practice: Locality. 

7.7 [ 10] <§7.2> g For More Practice: Locality. 

7.8 [ 10] <§7.2> @ For More Practice: Locality. 

7.9 ( 10] <§7.2> Here is a series of address references given as word addresses: 2, 
3, 11, 16, 21, 13, 64, 48, 19, 11, 3, 22, 4, 27, 6, and 11. Assuming a direct-mapped 
cache with 16 one-word blocks that is initially empty, label each reference in the 
list as a hit or a miss and show the final contents of the cache. 

7.10 [10] <§7.2> Using the series of references given in Exercise 7.9, show the hits 
and misses and final cache contents for a direct-mapped cache with four-word 
blocks and a total size of 16 words. 

7.11 [15] <§7.2> Given the following pseudocode: 

int array[10000, 100000]; 

for each element array[i][j] { 

array[i][j] = array[i ][j]*2; 



) 

write two C programs that implement this algorithm: one should access the ele- 
ments of the array in row-rnajor order, and the other should access them in col- 
umn-major order. Compare the execution times of the two programs. What does 
this tell you about the effects of memory layout on cache performance? 

7.12 [10] <§7.2> Compute the total number of bits required to implement the 
cache in Figure 7.9 on page 486. This number is different from the size of the 
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cache, which usually refers to the number of bytes of data stored in the cache. The 
number of bits needed to implement the cache represents the total amount of 
memory needed for storing all the data, tags, and valid bits. 

7.13 (10] <§7.2> Find a method to eliminate the AND gate on the valid bit in 
Figure 7.7 on page 478. (Hint: You need to change the comparison.) 

7.14 1 10 1 <§7.2> Consider a memory hierarchy using one of the three organiza- 
tions for main memory shown in Figure 7.1 1 on page 489. Assume that the cache 
block size is 16 words, that the width of organization (b) of the figure is four words, 
and that the number of banks in organization (c) is four. If the main memory latency 
for a new access is 10 memory bus clock cycles and the transfer time is 1 memory bus 
clock cycle, what are the miss penalties for each of these organizations? 

7.15 (10] <§7.2> @ For More Practice: Cache Performance. 

7.16 (15] <§7.2> Cache CI is direct-mapped with 16 one-word blocks. Cache C2 
is direct-mapped with 4 four-word blocks. Assume that the miss penalty for CI is 
8 memory bus clock cycles and the miss penalty for C2 is 1 1 memory bus clock cy- 
cles. Assuming that the caches are initially empty, find a reference string for which 
C2 has a lower miss rate but spends more memory bus clock cycles on cache misses 
than CI. Use word addresses. 

7.17 1 5 1 <§7.2> 3S In More Depth: Average Memory Access Time 

7.18 (5] <§7.2> SS In More Depth: Average Memory Access Time 

7.19 ( 10] <§7.2> @ In More Depth: Average Memory Access Time 

7.20 ( 10] <§7.2> Assume a memory system that supports interleaving either four 
reads or four writes. Given the following memory addresses in order as they ap- 
pear on the memory bus: 3, 9, 17,2,51,37, 13,4,8,41,67, 10, which ones will result 
in a bank conflict? 

7.21 (3 hours] <§7.3> Use a cache simulator to simulate several different cache 
organizations for the first 1 million references in a trace of gcc. Both dinero (a 
cache simulator) and the gcc traces are available — see the Preface of this book for 
information on how to obtain them. Assume an instruction cache of 32 KB and a 
data cache of 32 KB using the same organization. You should choose at least two 
kinds of associativity and two block sizes. Draw a diagram like that in Figure 7.17 
on page 503 that shows the data cache organization with the best hit rate. 

7.22 ( 1 day] <§7.3> You are commissioned to design a cache for a new system. It 
has a 32-bit physical byte address and requires separate instruction and data cach- 
es. The SRAMs have an access time of 1 .5 ns and a size of 32K X 8 bits, and you have 
a total of 16 SRAMs to use. The miss penalty for the memory system is 8 + 2 x Block 
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size in words. Using set associativity adds 0.2 ns to the cache access time. Using the 
first 1 million references of gcc, find the best I and D cache organizations, given the 
available SRAMs. 

7.23 [ 10] <§§7.2, B.5> @ For More Practice: Cache Configurations 

7.24 [10] <§§7.2,B.5> @ For More Practice: Cache Configurations 

7.25 [10] <§7.3> @ For More Practice: Cache Operation 

7.26 [10] <§7.3> © For More Practice: Cache Operation 

7.27 [10] <§7.3> M For More Practice: Cache Operation 

7.28 [5] <§7.3> Associativity usually improves the miss ratio, but not always. 
Give a short series of address references for which a two-way set-associative cache 
with LRU replacement would experience more misses than a direct-mapped cache 
of the same size. 

7.29 [15] <§7.3> Suppose a computer's address size is k bits (using byte address- 
ing), the cache size is S bytes, the block size is B bytes, and the cache is A -way set- 
associative. Assume that B is a power of two, so B = 2 . Figure out what the follow- 
ing quantities are in terms of S, B t A, b y and k: the number of sets in the cache, the 
number of index bits in the address, and the number of bits needed to implement 
the cache (see Exercise 7.12). 

7.30 [10] <§7.3> gg For More Practice: Cache Configurations. 

7.31 [10] <§7.3> 88 For More Practice: Cache Configurations. 

7.32 [20] <§7.3> Consider three processors with different cache configurations: 

■ Cache I: Direct-mapped with one-word blocks 

■ Gaelic 2: Direct-mapped with four-word blocks 

■ Cache 3: Two-way set associative with four-word blocks 
The following miss rate measurements have been made: 

■ Cache 1: Instruction miss rate is 4%; data miss rate is 6%. 

■ Cache 2: Instruction miss rate is 2%; data miss rate is 4%. 

■ Cache 3: Instruction miss rate is 2%; data miss rate is 3%. 

For these processors, one-half of the instructions contain a data reference. Assume 
that the cache miss penalty is 6 + Block size in words. The CPI for this workload 
was measured on a processor with cache 1 and was found to be 2.0. Determine 
which processor spends the most cycles on cache misses. 
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7.33 [5] <§7.3> The cycle times for the processors in Exercise 7.32 are 420 ps for 
the first and second processors and 310 ps for the third processor. Determine 
which processor is the fastest and which is the slowest. 

7.34 ( 15] <§7.3> Assume that the cache for the system described in Exercise 7.32 is 
two-way set associative and has eight-word blocks and a total size of 16 KB. Show the 
cache organization and access using the same format as Figure 7.17 on page 503. 

7.35 (10] <§§7.2, 7.4> The following C program is run (with no optimizations) 
on a processor with a cache that has eight-word (32-byte) blocks and holds 256 
bytes of data: 

int i , j , est ride, array [5 12] ; 

* * * 

for (i=0; K10000; i++) 

for (j=0; j<512; j=j+stride) 
c = array[j]+17; 

If we consider only the cache activity generated by references to the array and we 
assume that integers are words, what is the expected miss rate when the cache is 
direct mapped and stride = 256? How about if stride = 255? Would either of these 
change if the cache were two-way set associative? 

7.36 ( 10] <§§7.3, B.5> @ For More Practice: Cache Configurations 

7.37 (5] <§§7.2-7.4> @ For More Practice: Memory Hierarchy Interactions 

7.38 (4 hours] <§§7.2— 7.4> We want to use a cache simulator to simulate several 
different TLB and virtual memory organizations. Use the first 1 million references 
of gec for this evaluation. We want to know the TLB miss rate for each of the fol- 
lowing TLBs and page sizes: 

1. 64-entry TLB with full associativity and 4 KB pages 

2. 32-entry TLB with full associativity and 8 KB pages 

3. 64-entry TLB with eight-way associativity and 4 KB pages 

4. 128-entry TLB with four- way associativity and 4 KB pages 

7.39 (15] <§7.4> Consider a virtual memory system with the following properties: 

■ 40-bit virtual byte address 

■ 16 KB pages 

■ 36-bit physical byte address 

What is the total size of the page table for each process on this processor, assuming 
that the valid, protection, dirty, and use bits take a total of 4 bits and that all the vir- 
tual pages are in use? (Assume that disk addresses are not stored in the page table.) 
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7.40 [15] <§7.4> Assume that the virtual memory system of Exercise 7.39 is im- 
plemented with a two-way set-associative TLB with a total of 256 TLB entries. 
Show the virtual -to -physical mapping with a figure like Figure 7.24 on page 525. 
Make sure to label the width of all fields and signals. 

7.41 1 10 1 <§7.4> A processor has a 16-entry TLB and uses 4 KB pages. What are 
the performance consequences of this memory system if a program accesses at least 
2 MB of memory at a time? Can anything be done to improve performance? 

7.42 1 10 1 <§7.4> Buffer overflows are a common exploit used to gain control of 
a system. If a buffer is allocated on the stack, a hacker could overflow the buffer and 
insert a sequence of malicous instructions compromising the system. Can you 
think of a hardware mechanism that could be used to prevent this? 

7.43 [15] <§7.4> @ For More Practice: Hierarchical Page Tables 

7.44 1 1 5 1 <§7.4> 3S For More Practice: Hierarchical Page Tables 

7.45 [5] <§7.5> If all misses are classified into one of three categories — compul- 
sory, capacity, or conflict (as discussed on page 543) — which misses are likely to be 
reduced when a program is rewritten so as to require less memory? How about if 
the clock rate of the processor that the program is running on is increased? How 
about if the associativity of the existing cache is increased? 

7.46 [5] <§7.5> The following C program could be used to help construct a cache 
simulator. Many of the data types have not been defined, but the code accurately 
describes the actions that take place during a read access to a direct-mapped cache. 

word ReadDirectMappedCache( address a) 
static Entry cache[CACHE_SIZE_IN_WORDS] ; 
Entry e = cache[a .index] ; 
if (e. valid == FALSE !! e.tag != a. tag) { 

e.val id = true; 

e.tag = a .tag; 

e.data = load_f rom_memory(a) ; 

} 

return e.data; 

Your task is to modify this code to produce an accurate description of the actions 
that take place during a read access to a direct-mapped cache with multiple-word 
blocks. 

7.47 [8] <§7.5> This exercise is similar to Exercise 7.46, except this time write the 
code for read accesses to an «-way set -associative cache with one-word blocks. 
Note that your code will likely suggest that the comparisons are sequential in na- 
ture when in fact they would be performed in parallel by actual hardware. 



7.10 Exercises 



561 



7.48 (8] <§7.5> Extend your solution to Exercise 7.46 by including the specifica- 
tion of a new procedure for handling write accesses, assuming a write-through pol- 
icy. Be sure to consider whether or not your solution for handling read accesses 
needs to be modified. 

7.49 (8] <§7.5> Extend your solution to Exercise 7.46 by including the specifica- 
tion of a new procedure for handling write accesses, assuming a write-back policy. 
Be sure to consider whether or not your solution for handling read accesses needs 
to be modified. 

7.50 (8] <§7.5> This exercise is similar to Exercise 7.48, but this time extend your 
solution to Exercise 7.47. Assume that the cache uses random replacement. 

7.51 1 8 1 <§7.5> This exercise is similar to Exercise 7.49, but this time extend your 
solution to Exercise 7.47. Assume that the cache uses random replacement. 

7.52 [5] <§§7.7-7.8> Why might a compiler perform the following optimization? 
/* Before */ 



for (j = 0; j < 20; j++) 
for (i = 0; i < 200; i++ 
x[i][j] = x[i][j] + 1; 

/* After */ 



for (i = 0; i < 200; i++) 
for (j = 0; j < 20; j++) 
x[i][j] = x[i][j] + 1; 



§7.1, page 472: 1. 

§7.2, page 491: 1 and 4: A lower miss penalty can lead to smaller blocks, yet higher 

memory bandwidth usually leads to larger blocks, since the miss penalty is only 

slightly larger. 

§7.3, page 510: 1. 

§7.4, page 538: 1-a, 2-c, 3-c, 4-d. 

§7.5, page 545: 2. 



Answers To 
Check Yourself 



Computers 

in the 
Real World 



Saving the World's 
Art Treasures 



Problem: Find a way to help conserve art- 
work threatened by environmental factors and 
aging or damaged by earlier attempts at resto- 
ration without causing further harm to irre- 
placeable artworks. 

Solution: Use computers and scientific 
instrumentation to analyze the artwork and its 
setting, enabling art conservators and restor- 
ers to undertake a more informed and success- 
ful preservation of an artwork. 

Art conservation and restoration have devel- 
oped into high-technology fields that make 
extensive use of computing and scientific 
instrumentation. For example, one of the most 
challenging forms of art to restore and main- 
tain are frescoes, which are painted in the wet 
plaster of a wall or ceiling. Moisture and heat 
change the surface and cause deterioration; 
similarly air pollution, smoke from candles, 
and other contaminants directly attack the 
paint as well as add dirt and grime that cover 
the original artwork. 

During the restoration of Michelangelo's 
frescoes in the Sistine Chapel, computers were 
used to survey the ceiling, finding cracks and 
precisely mapping the surface and the frescos. 



Since the width of the ceiling and walls varies 
from about three feet to almost six feet, there 
are significant differences in thermal behavior, 
which in turn affects the surface painting. 
Computers were used to model the entire 
structure, including the high humidity gener- 
ated when a thousand people stand inside the 
chapel on a warm day! This led to a computer- 
controlled climate system that uses sensors 
placed in strategic locations. The goal is to 
keep the visitors cool while preserving Miche- 
langelo's masterpiece for generations to come. 




A laser scan of Michelangelo's statue of David 



Perhaps the area of art conservation that 
has been most affected by the availability of 
low-cost, high-performance computation has 
been painting restoration. Three techniques — 
infrared reflectography, ultraviolet imaging, 
and X-radiography — have found the heaviest 
use. Because of the need for highly precise, 
high resolution imaging, computer controlled 
cameras or X-ray scanners are used in all these 
techniques. This results in a patchwork of 
images, which are then stitched together by a 
computer. The combination of computer con- 
trolled motion of a camera or X-ray scanner 
and subsequent computer composition of tens 
to thousands of images permits that scanning 
of large surfaces at very high resolution. 

Infrared reflectography uses light in the 
near-infrared spectrum and a digital camera 
to detect the intensity of reflection of the light 
from the surface of a painting, mural, or 
fresco. This technique is useful for finding the 
underdrawing that most artists use to initially 
sketch out the forms in a painting. The under- 
drawing, typically done in black, often using 
charcoal, absorbs the infrared light. In the fig- 




ure below, are two images of a painting: one 
shown in normal light (on the left) and one 
using the infrared reflectography technique 
(on the right). 

Restorers use ultraviolet imaging to look at 
the original colors of a painting that has been 
retouched. X-radiography provides similar 
information, since white and yellow pigments 
that were covered or painted over appear 
darker due to their lead content. 

Scanning technologies have also been 
applied to three-dimensional art objects, such 
as sculpture. Michelangelo's David was 
scanned using a laser range finder by a group 
led by Professor Marc Levoy at Stanford. The 
resulting database for a scan with 0.29 mm 
resolution consists of over 2 billion polygons 
and 32 gigabytes of data. The Digital Miche- 
langelo project has created a detailed model of 
the famous sculpture useful both for conser- 
vation as well as an educational tool for stu- 
dents around the world. Two of the many 
images that can be derived from the three- 
dimensional scan as shown opposite. 

To learn more see these references on 
the |g library 



Conserving paintings, a site dedicated to Harvard Uni- 
versity's digital imaging lab. 

Sistine Chapel, a short background on the Sistine 
Chapel. 

The Digital Michelangelo project 



An image from the Sistine Chapel in normal light 
(left) and in Infrared (right). 
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8.1 



Introduction 



Although users can get frustrated if their computer hangs and must be rebooted, 
they become apoplectic if their storage system crashes and they lose information. 
Thus, the bar for dependability is much higher for storage than for computation. 
Networks also plan for failures in communication, including several mechanisms 
to detect and recover from such failures. Hence, I/O systems generally place much 
greater emphasis on dependability and cost, while processors and memory focus 
on performance and cost. 

I/O systems must also plan for expandability and for diversity of devices, which 
is not a concern for processors. Expandability is related to storage capacity, which 
is another design parameter for I/O systems; systems may need a lower bound of 
storage capacity to fulfill their role. 

Although performance plays a smaller role for I/O, it is more complex. For 
example, with some devices we may care primarily about access latency, while 
with others throughput is crucial. Furthermore, performance depends on many 
aspects of the system: the device characteristics, the connection between the 
device and the rest of the system, the memory hierarchy, and the operating sys- 
tem. Figure 8.1 shows the structure of a simple system with its I/O. All of the com- 
ponents, from the individual I/O devices to the processor to the system software, 
will affect the dependability, expandability, and performance of tasks that include 
I/O. 

I/O devices are incredibly diverse. Three characteristics are useful in organizing 
this wide variety: 

■ Behavior: Input (read once), output (write only, cannot be read), or storage 
(can be reread and usually rewritten). 

■ Partner: Either a human or a machine is at the other end of the I/O device, 
either feeding data on input or reading data on output. 

■ Data rate: The peak rate at which data can be transferred between the I/O 
device and the main memory or processor. It is useful to know what maxi- 
mum demand the device may generate. 

For example, a keyboard is an input device used by a human with a peak data rate 
of about 10 bytes per second. Figure 8.2 shows some of the I/O devices connected 
to computers. 

In Chapter 1, we briefly discussed four important and characteristic I/O 
devices: mice, graphics displays, disks, and networks. In this chapter we go into 
much more depth on disk storage and networks. 
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Network 




FIGURE 8.1 A typical collection of I/O devices. The connections between the I/O devices, pro- 
cessor, and memory are usually called buses. Communication among the devices and the processor use both 
interrupts and protocols on the bus, as we will see in this chapter. Figure 8. 1 1 on page 585 shows the organi- 
zation for a desktop PC 



How we should assess I/O performance often depends on the application. In 
some environments, we may care primarily about system throughput. In these 
cases, I/O bandwidth will be most important. Even I/O bandwidth can be mea- 
sured in two different ways: 

1. How much data can we move through the system in a certain time? 

2. How many I/O operations can we do per unit of time? 

Which performance measurement is best may depend on the environment. For 
example, in many multimedia applications, most I/O requests are for long streams 
of data, and transfer bandwidth is the important characteristic. In another 
environment, we may wish to process a large number of small, unrelated accesses 
to an I/O device. An example of such an environment might be a tax-processing 
office of the National Income Tax Service (NITS). NITS mostly cares about pro- 
cessing a large number of forms in a given time; each tax form is stored separately 
and is fairly small. A system oriented toward large file transfer may be satisfactory, 
but an I/O system that can support the simultaneous transfer of many small files 
may be cheaper and faster for processing millions of tax forms. 
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Device 


Behavior 


Partner 


Data rate (Mbit/ sec) 1 


Keyboard 


input 


human 


0.0001 


Mouse 


input 


human 


0.0038 


Voice input 


input 


human 


0.2640 


Sound input 


input 


machine 


3.0000 


Scanner 


input 


human 


3.2000 


Voice output 


output 


human 


0.2640 


Sound output 


output 


human 


8.0000 


Laser printer 


output 


human 


3.2000 


Graphics display 


output 


human 


800.0000-8000.0000 


Modem 


input or output 


machine 


0.0160-0.0640 


Network/ LAN 


input or output 


machine 


100.0000-1000.0000 


Network/wireless LAN 


input or output 


machine 


11.0000-54.0000 


Optical disk 


storage 


machine 


80.0000 


Magnetic tape 


storage 


machine 


32.0000 


Magnetic disk 


storage 


machine 


240.0000-2560.0000 



FIGURE 8.2 The diversity of I/O devices. I/O devices can be distinguished by whether they serve as 
input, output, or storage devices; their communication partner (people or other computers); and their peak 
communication rates. The data rates span eight orders of magnitude. Note that a network can be an input 
or an output device, but cannot be used for storage. Transfer rates for devices are always quoted in base 10, 
so that 10 Mbit/sec = 10,000,000 bits/sec. 



I/O requests Reads or writes to 
I/O devices. 



In other applications, we care primarily about response time, which you will 
recall is the total elapsed time to accomplish a particular task. If the I/O requests 
are extremely large, response time will depend heavily on bandwidth, but in many 
environments most accesses will be small, and the I/O system with the lowest 
latency per access will deliver the best response time. On single-user machines 
such as desktop computers and laptops, response time is the key performance 
characteristic. 

A large number of applications, especially in the vast commercial market for 
computing, require both high throughput and short response times. Examples 
include automatic teller machines (ATMs), order entry and inventory tracking 
systems, file servers, and Web servers. In such environments, we care about both 
how long each task takes and how many tasks we can process in a second. The 
number of ATM requests you can process per hour doesn't matter if each one 
takes 15 minutes — you won't have any customers left! Similarly, if you can process 
each ATM request quickly but can only handle a small number of requests at once, 
you won't be able to support many ATMs, or the cost of the computer per ATM 
will be very high. 

In summary, the three classes of desktop, server, and embedded computers are 
sensitive to I/O dependability and cost. Desktop and embedded systems are more 
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focused on response time and diversity of I/O devices, while server systems are 
more focused on throughput and expandability of I/O devices. 



8.2 



Disk Storage and Dependability 



As mentioned in Chapter 1, magnetic disks rely on a rotating platter coated with a 
magnetic surface and use a moveable read/write head to access the disk. Disk stor- 
age is nonvolatile — the data remains even when power is removed. A magnetic 
disk consists of a collection of platters (1-4), each of which has two recordable 
disk surfaces. The stack of platters is rotated at 5400 to 15,000 RPM and has a 
diameter from an inch to just over 3.5 inches. Each disk surface is divided into 
concentric circles, called tracks. There are typically 10,000 to 50,000 tracks per 
surface. Each track is in turn divided into sectors that contain the information; 
each track may have 100 to 500 sectors. Sectors are typically 512 bytes in size, 
although there is an initiative to increase the sector size to 4096 bytes. The 
sequence recorded on the magnetic media is a sector number, a gap, the informa- 
tion for that sector including error correction code (see ® Appendix B, page B- 
64), a gap, the sector number of the next sector, and so on. Originally, all tracks 
had the same number of sectors and hence the same number of bits, but with the 
introduction of zone bit recording (ZBR) in the early 1990s, disk drives changed 
to a varying number of sectors (and hence bits) per track, instead keeping the 
spacing between bits constant. ZBR increases the number of bits on the outer 
tracks and thus increases the drive capacity. 

As we saw in Chapter 1, to read and write information the read/write heads 
must be moved so that they are over the correct location. The disk heads for each 
surface are connected together and move in conjunction, so that every head is 
over the same track of every surface. The term cylinder is used to refer to all the 
tracks under the heads at a given point on all surfaces. 

To access data, the operating system must direct the disk through a three-stage 
process. The first step is to position the head over the proper track. This operation 
is called a seek, and the time to move the head to the desired track is called the 
seek time. 

Disk manufacturers report minimum seek time, maximum seek time, and 
average seek time in their manuals. The first two are easy to measure, but the aver- 
age is open to wide interpretation because it depends on the seek distance. The 
industry has decided to calculate average seek time as the sum of the time for all 
possible seeks divided by the number of possible seeks. Average seek times are 
usually advertised as 3 ms to 14 ms, but, depending on the application and sched- 
uling of disk requests, the actual average seek time may be only 25% to 33% of the 



nonvolatile Storage device 
where data retains its value even 
when power is removed. 

track One of thousands of con- 
centric circles that makes up the 
surface of a magnetic disk. 

sector One of the segments 
that make up a track on a mag- 
netic disk; a sector is the small- 
est amount of information that 
is read or written on a disk. 



seek The process of positioning 
a read/write head over the 
proper track on a disk. 
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rotation latency Also called 
delay. The time required for the 
desired sector of a disk to rotate 
under the read/write head; usu- 
ally assumed to be half the 
rotation time. 



advertised number because of locality of disk references. This locality arises both 
because of successive accesses to the same file and because the operating system 
tries to schedule such accesses together. 

Once the head has reached the correct track, we must wait for the desired sec- 
tor to rotate under the read/write head. This time is called the rotational latency 
or rotational delay. The average latency to the desired information is halfway 
around the disk. Because the disks rotate at 5400 RPM to 15,000 RPM, the average 
rotational latency is between 



Average rotational latency = 



0.5 rotation 

3400 RPM 



0.5 rotation 



5400 RPM/| 60 seconds 

V minute 



= 0.0056 seconds = 5.6 ms 



and 



0.5 rotation 



Average rotational latency = °- 5 rotatlon = 

15,000 RPM 15,000 RPM/[ 60^f°^ 






minute 



= 0.0020 seconds = 2.0 ms 



The last component of a disk access, transfer time, is the time to transfer a block 
of bits. The transfer time is a function of the sector size, the rotation speed, and 
the recording density of a track. Transfer rates in 2004 are between 30 and 80 
MB/sec. The one complication is that most disk controllers have a built-in cache 
that stores sectors as they are passed over; transfer rates from the cache are typi- 
cally higher and may be up to 320 MB/sec in 2004. Today, most disk transfers are 
multiple sectors in length. 

A disk controller usually handles the detailed control of the disk and the transfer 
between the disk and the memory. The controller adds the final component of 
disk access time, controller time, which is the overhead the controller imposes in 
performing an I/O access. The average time to perform an I/O operation will con- 
sist of these four times plus any wait time incurred because other processes are 
using the disk. 



EXAMPLE 



Disk Read Time 

What is the average time to read or write a 512-byte sector for a typical disk 
rotating at 10,000 RPM? The advertised average seek time is 6 ms, the transfer 
rate is 50 MB/sec, and the controller overhead is 0.2 ms. Assume that the disk 
is idle so that there is no waiting time. 
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Average disk access time is equal to Average seek time + Average rotational 
delay + Transfer time + Controller overhead. Using the advertised average 
seek time, the answer is 

6.0 ms + 0-5 rotation + 0.5 KB + Q2 ms - 5.0 + 3.0 + 0.01 + 0.2 = 9.2 ms 
10,000 RPM 50 MB/sec 

If the measured average seek time is 25% of the advertised average time, the 
answer is 

1.5 ms + 3.0 ms + 0.01 ms + 0.2 ms = 4.7 ms 

Notice that when we consider measured average seek time, as opposed to 
advertised average seek time, the rotational latency can be the largest compo- 
nent of the access time. 



Disk densities have continued to increase for more than 50 years. The impact 
of this compounded improvement in density and the reduction in physical size of 
a disk drive has been amazing, as Figure 8.3 shows. The aims of different disk 
designers have led to a wide variety of drives being available at any particular time. 
Figure 8.4 shows the characteristics of three magnetic disks. In 2004, these disks 
from a single manufacturer cost between $0.50 and $5 per gigabyte, depending on 
size, interface, and performance. The smaller drive has advantages in power and 
volume per byte. 



Elaboration: Most disk controllers include caches. Such caches allow for fast 
access to data that was recently read between transfers requested by the CPU. They 
use write through and do not update on a write miss. They often also include prefetch 
algorithms to try to anticipate demand. Of course, such capabilities complicate the 
measurement of disk performance and increase the importance of workload choice. 



Dependability, Reliability, and Availability 

Users crave dependable storage, but how do you define it? In the computer indus- 
try, it is harder than looking it up in the dictionary. After considerable debate, the 
following is considered the standard definition (Laprie 1985): 

Computer system dependability is the quality of delivered service such that reli- 
ance can justifiably be placed on this service. The service delivered by a system 
is its observed actual behavior as perceived by other system(s) interacting with 
this system's users. Each module also has an ideal specified behavior y where a 
service specification is an agreed description of the expected behavior. A system 
failure occurs when the actual behavior deviates from the specified behavior. 
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FIGURE 8.3 Six magnetic disks, varying in diameter from 14 inches down to 1.8 inches. 

The IBM microdrive, not shown, has a 1-inch diameter. The pictured disks were introduced over more than 
15 years ago and hence are not intended to be representative of the best capacity of modern disks of these 
diameters. This photograph does, however, accurately portray their relative physical sizes. The widest disk is 
the DEC R81, containing four 14- inch diameter platters and storing 456 MB. It was manufactured in 1985. 
The 8-inch diameter disk comes from Fujitsu, and this 1984 disk stores 130 MB on six platters. The Microp- 
olis RD53 has five 5.25-inch platters and stores 85 MB. The IBM 0361 also has five platters, but these are just 
3.5 inches in diameter. This 1988 disk holds 320 MB. In 2004, the most dense 3.5-inch disk had 2 platters 
and holds 200 GB in the same space, yielding an increase in density of about 600 times! The Conner CP 
2045 has two 2.5-inch platters containing 40 MB and was made in 1990. The smallest disk in this photo- 
graph is the Integral 1820. This single 1.8-inch platter contains 20 MB and was made in 1992. Figure 8.11 
on page 585 shows a 10-inch drive that holds 340 MB. 



Thus, you need a reference specification of expected behavior to be able to 
determine dependability. Users can then see a system alternating between two 
states of delivered service with respect to the service specification: 

1 . Service accomplishment, where the service is delivered as specified 

2. Service interruption, where the delivered service is different from the speci- 
fied service 

Transitions from state 1 to state 2 are caused by failures, and transitions from state 
2 to state 1 are called restorations. Failures can be permanent or intermittent. The 
latter is the more difficult case to diagnose when a system oscillates between the 
two states; permanent failures are much easier to diagnose. This definition leads 
to two related terms: reliability and availability. 
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Characteristics 


Seagate ST373453 


Seagate ST3200822 


Seagate ST94811A 


Disk diameter (inches) 


3.50 


3.50 


2.50 


Formatted data capacity (GB) 


73.4 


200.0 


40.0 


Number of disk surfaces (heads) 


8 


4 


2 


Rotation speed (RPM) 


15,000 


7200 


5400 


Internal disk cache size (MB) 


8 


8 


8 


External interface, bandwidth (MB/ sec) 


Ultra320 SCSI, 320 


Serial ATA, 150 


ATA, 100 


Sustained transfer rate (MB/sec) 


57-86 


32-58 


34 


Minimum seek (read/write) (ms) 


0.2/0.4 


1.0/1.2 


1.5/2.0 


Average seek read/write (ms) 


3.6/3.9 


8.5/9.5 


12.0/14.0 


Mean time to failure (MTTF) (hours) 


1,200,000 @ 25 e C 


600,000 @ 25°C 


330,000 (§> 25°C 


Warranty (years) 


5 


3 





Nonrecoverable read errors per bits read 


<1 per 10 15 


< 1 per 10 14 


< 1 per 10 14 


Temperature, vibration limits (operating) 


5 -55'C, 400 Hz @ 0.5 G 


e -60°C, 350 Hz @ 0.5 G 


5 -55°C, 400 Hz @ 1 G 


Size: dimensions (in.), weight (pounds) 


1.0" x 4.0" x 5.8", 1.9 lbs 


1.0" x 4.0" x 5.8", 1.4 lbs 


0.4" x 2.7" x 3.9", 0.2 lbs 


Power: operating/idle/ standby (watts) 


20?/ 12/— 


12/8/1 


2.4/1.0/0.4 


GB/cu. in., GB/watt 


3 GB/cu.in., 4 GB/W 


9 GB/cu.in., 16 GB/W 


10 GB/cu.in., 17 GB/W 


Price in 2004, $/GB 


*> $400, - $5/GB 


- $100, - $0.5/GB 


~ $100, - $2.50/GB 



FIGURE 8.4 Characteristics of three magnetic disks by a single manufacturer in 2004. The disks shown here either interface to 
SCSI, a standard I/O bus for many systems, or ATA, a standard I/O bus for PCs. The first disk is intended for file servers, the second for desktop PCs, 
and the last for laptop computers. Each disk has an 8 MB cache. The transfer rate from the cache is 3-6 times faster than the transfer rate from the disk 
surface. The much lower cost of the ATA 3.5-inch drive is primarily due to the hypercompetitive PC market, although there are differences in perfor- 
mance and reliability between it and the SCSI drive. The service life for these disks is 5 years, although Seagate offers a 5-year guarantee only on the 
SCSI drive, with a I -year guarantee on the other two. Note that the quoted MTTF assumes nominal power and temperature. Disk lifetimes can be 
much shorter if temperature and vibration are not controlled. See the link to Seagate at www.seagate.com for more information on these drives. 



Reliability is a measure of the continuous service accomplishment — or, equiva- 
lently, of the time to failure — from a reference point. Hence, the mean time to fail- 
ure (MTTF) of disks in Figure 8.4 is a reliability measure. Service interruption is 
measured as mean time to repair (MTTR). Mean time between failures (MTBF) is 
simply the sum of MTTF + MTTR. Although MTBF is widely used, MTTF is 
often the more appropriate term. 

Availability is a measure of the service accomplishment with respect to the 
alternation between the two states of accomplishment and interruption. Availabil- 
ity is statistically quantified as 

Availability = MIIE 

(MTTF + MTTR) 

Note that reliability and availability are quantifiable measures, rather than just 
synonyms for dependability. 

What is the cause of failures? Figure 8.5 summarizes many papers that have col- 
lected data on reasons for computer systems and telecommunications systems to 
fail. Clearly, human operators are a significant source of failures. 



small computer systems 
interface (SCSI) A bus used as 
a standard for I/O devices. 
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Operator 


Software 


Hardware 


System 


Year data collected ft 


42% 


25% 


18% 


Data center (Tandem) 


1985 


15% 


55% 


14% 


Data center (Tandem) 


1989 


18% 


44% 


39% 


Data center (DEC VAX) 


1985 


50% 


20% 


30% 


Data center (DEC VAX) 


1993 


50% 


14% 


19% 


U.S. public telephone network 


1996 


54% 


7% 


30% 


U.S. public telephone network 


2000 


60% 


25% 


15% 


Internet services 


2002 



FIGURE 8.5 Summary of studies of reasons for failures. Although it is difficult to collect data 
to determine if operators are the cause of errors, since operators often record the reasons for failures, these 
studies did capture that data. There were often other categories, such as environmental reasons for outages, 
but they were generally small. The top two rows come from a classic paper by Jim Gray [1990], which is still 
widely quoted almost 20 years after the data was collected. The next two rows are from a paper by Murphy 
and Gent who studied causes of outages in VAX systems over time ("Measuring system and software reli- 
ability using an automated data collection process," Quality and Reliability Engineering International 1 1:5, 
September-October 1995, 341-53). The fifth and sixth rows are studies of FCC failure data about the U.S. 
public switched telephone network by Kuhn ("Sources of failure in the public switched telephone network," 
IEEE Computer 30:4, April 1997,31-36) and by Patty Enriquez. The most recent study of three Internet ser- 
vices is from Oppenheimer, Ganapath, and Patterson [2003]. 

To increase MTTF, you can improve the quality of the components or design 
systems to continue operation in the presence of components that have failed. 
Hence, failure needs to be defined with respect to a context. A failure in a compo- 
nent may not lead to a failure of the system. To make this distinction clear, the 
term fault is used to mean failure of a component. Here are three ways to improve 
MTTF: 

1. Fault avoidance: preventing fault occurrence by construction 

2. Fault tolerance: using redundancy to allow the service to comply with the 
service specification despite faults occurring, which applies primarily to 
hardware faults 

3. Fault forecasting: predicting the presence and creation of faults, which 
applies to hardware and software faults 

Shrinking MTTR can help availability as much as increasing MTTF. For example, 
tools for fault detection, diagnosis, and repair can help reduce the time to repair 
faults by people, software, and hardware. 



redundant arrays of inexpen- 
sive disks (RAID) An 

organization of disks that uses 
an array of small and inexpen- 
sive disks so as to increase both 
performance and reliability. 



RAID 

Leveraging redundancy to improve the availability of disk storage is captured in 
the phrase Redundant Arrays of Inexpensive Disks, abbreviated RAID. At the 
time the term was coined, the alternative was large, expensive disks, such as the 
larger ones in Figure 8.3. The argument was that by replacing a few large disks 
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with many small disks, performance would improve because there would be more 
read heads, and there would be advantages in cost, power, and floor space since 
smaller disks are much more efficient per gigabyte than larger disks. Redundancy 
was needed because the many more smaller disks had lower reliability than a few 
large disks. 

By having many small disks, the cost of extra redundancy to improve depend- 
ability is small relative to the large disks. Thus, dependability was more affordable 
if you constructed a redundant array of inexpensive disks. In retrospect, this was 
the key advantage. 

How much redundancy do you need? Do you need extra information to find 
the faults? Does it matter how you organize the data and the extra check informa- 
tion on these disks? The paper that coined the term gave an evolutionary answer 
to these questions, starting with the simplest but most expensive solution. Figure 
8.6 shows the evolution and example cost in number of extra check disks. To keep 
track of the evolution, the authors numbered the stages of RAID, and they are still 
used today. 



No Redundancy (RAID 0) 

Simply spreading data over multiple disks, called striping, automatically forces 
accesses to several disks. Striping across a set of disks makes the collection appear 
to software as a single large disk, which simplifies storage management. It also 
improves performance for large accesses, since many disks can operate at once. 
Video-editing systems, for example, often stripe their data and may not worry 
about dependability as much as, say, databases. 

RAID is something of a misnomer as there is no redundancy. However, RAID 
levels are often left to the operator to set when creating a storage system, and 
RAID is often listed as one of the options. Hence, the term RAID has become 
widely used. 



striping Allocation of logically 
sequential blocks to separate 
disks to allow higher perfor- 
mance than a single disk can 
deliver. 



Mirroring (RAID 1) 

This traditional scheme for tolerating disk failure, called mirroring or shadowing, 
uses twice as many disks as does RAID 0. Whenever data are written to one disk, 
those data are also written to a redundant disk, so that there are always two copies 
of the information. If a disk fails, the system just goes to the "mirror" and reads its 
contents to get the desired information. Mirroring is the most expensive RAID 
solution, since it requires the most disks. 



mirroring Writing the identi- 
cal data to multiple disks to 
increase data availability. 



Error Detecting and Correcting Code (RAID 2) 

RAID 2 borrows an error detection and correction scheme most often used for 
memories (see 8S Appendix B). Since RAID 2 has fallen into disuse, we'll not 
describe it here. 
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Data disks 



Check disks 



RAIDO 

(No redundancy) 

Widely used 

RAID1 

(Mirroring) 

EMC, HP(Tandem), IBM 

RAID 2 

(Error correction code) 

Unused 

RAID 3 

(Bit-interleaved parity) 
Storage Concepts 

RAID 4 

(Block-interieaving parity) 
Network Appliance 



RAID 5 

(Distributed block- 
interleaved parity) 
Widely used 

RAID 6 

(P + Q redundancy 
Rarely used 




FIGURE 8.6 RAID for an example of four data disks showing extra check disks per RAID 
level and companies that use each level. Figures 8.7 and 8.8 explain the difference between RAID 
3, RAID 4, and RAID 5. 



protection group The group 
of data disks or blocks that share 
a common check disk or block. 



Bit-Interleaved Parity (RAID 3) 

The cost of higher availability can be reduced to 1/N, where N is the number of 
disks in a protection group. Rather than have a complete copy of the original data 
for each disk, we need only add enough redundant information to restore the lost 
information on a failure. Reads or writes go to all disks in the group, with one 
extra disk to hold the check information in case there is a failure. RAID 3 is popu- 
lar in applications with large data sets, such as multimedia and some scientific 
codes. 

Parity is one such scheme. Readers unfamiliar with parity can think of the 
redundant disk as having the sum of all the data in the other disks. When a disk 
fails, then you subtract all the data in the good disks from the parity disk; the 
remaining information must be the missing information. Parity is simply the sum 
modulo two. 
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Unlike RAID 1, many disks must be read to determine the missing data. The 
assumption behind this technique is that taking longer to recover from failure but 
spending less on redundant storage is a good trade-off. 

Block-Interleaved Parity (RAID 4) 

RAID 4 uses the same ratio of data disks and check disks as RAID 3, but they 
access data differently. The parity is stored as blocks and associated with a set of 
data blocks. 

In RAID 3, every access went to all disks. However, some applications prefer 
smaller accesses, allowing independent accesses to occur in parallel. That is the 
purpose of the RAID levels 4 to 6. Since error detection information in each sector 
is checked on reads to see if data are correct, such "small reads" to each disk can 
occur independently as long as the minimum access is one sector. In the RAID 
context, a small access goes to just one disk in a protection group while a large 
access goes to all the disks in a protection group. 

Writes are another matter. It would seem that each small write would demand 
that all other disks be accessed to read the rest of the information needed to 
recalculate the new parity, as in Figure 8.7. A "small write" would require reading 
the old data and old parity, adding the new information, and then writing the new 
parity to the parity disk and the new data to the data disk. 



New Data 



1. Read 2. Read 3. Read 




4. Write 



5. Write 



New Data 1 . Read 



2. Read 




3. Write 



4. Write 



FIGURE 8.7 Small write update on RAID 3 versus RAID 4. This optimization for small writes reduces the number of disk accesses as well as 
the number of disks occupied. This figure assumes we have four blocks of data and one block of parity. The straightforward RAID 3 parity calculation 
in the left of the figure reads blocks Dl, D2, and D3 before adding block DO' to calculate the new parity P'. (In case you were wondering, the new data 
DO' comes directly from the CPU, so disks are not involved in reading it.) The RAID 4 shortcut on the right reads the old value DO and compares it to 
the new value DO' to see which bits will change. You then read to old parity P and then change the corresponding bits to form P'. The logical function 
exclusive OR does exactly what we want. This example replaces three disk reads (Dl , D2, D3) and two disk writes (DO', P') involving all the disks for 
two disk reads (DO, P) and two disk writes (DO', P'), which involve just two disks. Increasing the size of the parity group increases the savings of the 
shortcut. RAID 5 uses the same shortcut. 



578 Chapter 8 



The key insight to reduce this overhead is that parity is simply a sum of infor- 
mation; by watching which bits change when we write the new information, we 
need only change the corresponding bits on the parity disk. Figure 8.7 shows the 
shortcut. We must read the old data from the disk being written, compare old data 
to the new data to see which bits change, read the old parity, change the corre- 
sponding bits, then write the new data and new parity. Thus, the small write 
involves four disk accesses to two disks instead of accessing all disks. This organi- 
zation is RAID 4. 

Distributed Block-Interleaved Parity (RAID 5) 

RAID 4 efficiently supports a mixture of large reads, large writes, and small reads, 
plus it allows small writes. One drawback to the system is that the parity disk must 
be updated on every write, so the parity disk is the bottleneck for back-to-back 
writes. 

To fix the parity-write bottleneck, the parity information can be spread 
throughout all the disks so that there is no single bottleneck for writes. The dis- 
tributed parity organization is RAID 5. 

Figure 8.8 shows how data are distributed in RAID 4 versus RAID 5. As the 
organization on the right shows, in RAID 5 the parity associated with each row of 
data blocks is no longer restricted to a single disk. This organization allows multi- 
ple writes to occur simultaneously as long as the parity blocks are not located to 
the same disk. For example, a write to block 8 on the right must also access its par- 
ity block P2, thereby occupying the first and third disks. A second write to block 5 
on the right, implying an update to its parity block Pi, accesses the second and 
fourth disks and thus could occur concurrently with the write to block 8. Those 
same writes to the organization on the left result in changes to blocks Pi and P2, 
both on the fifth disk, which is a bottleneck. 

P + Q Redundancy (RAID 6) 

Parity-based schemes protect against a single self-identifying failure. When a single 
failure correction is not sufficient, parity can be generalized to have a second calcu- 
lation over the data and another check disk of information. This second check 
block allows recovery from a second failure. Thus, the storage overhead is twice 
that of RAID 5. The small write shortcut of Figure 8.7 works as well, except now 
there are six disk accesses instead of four to update both P and Q information. 

RAID Summary 

RAID 1 and RAID 5 are widely used in servers; one estimate is 80% of disks in 
servers are found in some RAID system. 

One weakness of the RAID systems is repair. First, to avoid making the data 
unavailable during repair, the array must be designed to allow the failed disks to 
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FIGURE 8.8 Block-interleaved parity (RAID 4) versus distributed block-interleaved par- 
ity (RAID 5). By distributing parity blocks to all disks, some small writes can be performed in 
parallel. 



be replaced without having to turn off the system. RAIDs have enough redun- 
dancy to allow continuous operation, but hot swapping disks places demands on 
the physical and electrical design of the array and the disk interfaces. Second, 
another failure could occur during repair, so the repair time affects the chances of 
losing data: the longer the repair time, the greater the chances of another failure 
that will lose data. Rather than having to wait for the operator to bring in a good 
disk, some systems include standby spares so that the data can be reconstructed 
immediately upon discovery of the failure. The operator can then replace the 
failed disks in a more leisurely fashion. Third, although disk manufacturers quote 
very high MTTF for their products, those numbers are under nominal conditions. 
If a particular disk array has been subject to temperature cycles due to, say, the 
failure of the air conditioning system, or to shaking due to a poor rack design, 
construction, or installation, the failure rates will be much higher. The calculation 
of RAID reliability assumes independence between disk failures, but disk failures 
could be correlated because such damage due to the environment would likely 
happen to all the disks in the array. Finally, a human operator ultimately deter- 
mines which disks to remove. As Figure 8.5 shows, operators are only human, so 
they occasionally remove the good disk instead of the broken disk, leading to an 
unrecoverable disk failure. 

Although RAID 6 is rarely used today, a cautious operator might want its extra 
redundancy to protect against expected hardware failures plus a safety margin to 
protect against human error and correlated failures due to problems with the 
environment. 



hot swapping Replacing a 
hardware component while the 
system is running. 



standby spares Reserve hard- 
ware resources that can immedi 
ately take the place of a failed 
component. 
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Check 
Yourself 



Which of the following are true about dependability? 

1. If a system is up, then all its components are accomplishing their expected 
service. 

2. Availability is a quantitative measure of the percentage of time a system is 
accomplishing its expected service. 

3. Reliability is a quantitative measure of continuous service accomplishment 
by a system. 

4. The major source of outages today is software. 

Which of the following are true about RAID levels 1, 3, 4, 5, and 6? 

1 . RAID systems rely on redundancy to achieve high availability. 

2. RAID 1 (mirroring) has the highest check disk overhead. 

3. For small writes, RAID 3 (bit-interleaved parity) has the worst throughput. 

4. For large writes, RAID 3, 4, and 5 have the same throughput. 



Elaboration: One issue is how mirroring interacts with striping. Suppose you had, 
say, four disks worth of data to store and eight physical disks to use. Would you create 
four pairs of disks — each organized as RAID 1 — and then stripe data across the four 
RAID 1 pairs? Alternatively, would you create two sets of four disks — each organized as 
RAID — and then mirror writes to both RAID sets? The RAID terminology has evolved 
to call the former RAID 1 + or RAID 10 ("striped mirrors") and the latter RAID + 1 or 
RAID 01 ("mirrored stripes"). 



8.3 



Networks 



Networks are growing in popularity over time, and unlike other I/O devices, there 
are many books and courses on them. For readers who have not taken courses or 
read books on networking, Section 8.3 on the @ CD gives a quick overview of the 
topics and terminology, including internetworking, the OSI model, protocol fam- 
ilies such as TCP/IP, long-haul networks such as ATM, local area networks such as 
Ethernet, and wireless networks such as IEEE 802.1 1. 
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Buses and Other Connections between 
Processors, Memory, and I/O Devices 



In a computer system, the various subsystems must have interfaces to one another. 
For example, the memory and processor need to communicate, as do the proces- 
sor and the I/O devices. For many years, this has been done with a bus. A bus is a 
shared communication link, which uses one set of wires to connect multiple sub- 
systems. The two major advantages of the bus organization are versatility and low 
cost. By defining a single connection scheme, new devices can easily be added, and 
peripherals can even be moved between computer systems that use the same kind 
of bus. Furthermore, buses are cost-effective because a single set of wires is shared 
in multiple ways. 

The major disadvantage of a bus is that it creates a communication bottleneck, 
possibly limiting the maximum I/O throughput. When I/O must pass through a 
single bus, the bandwidth of that bus limits the maximum I/O throughput. 
Designing a bus system capable of meeting the demands of the processor as well as 
connecting large numbers of I/O devices to the machine presents a major chal- 
lenge. 

One reason bus design is so difficult is that the maximum bus speed is largely 
limited by physical factors: the length of the bus and the number of devices. 
These physical limits prevent us from running the bus arbitrarily fast. In addition, 
the need to support a range of devices with widely varying latencies and data 
transfer rates also makes bus design challenging. 

As it becomes difficult to run many parallel wires at high speed due to clock 
skew and reflection, the industry is in transition from parallel shared buses to 
high-speed serial point-to-point interconnections with switches. Thus, such net- 
works are gradually replacing buses in our systems. 

As a result of this transition, this section has been revised in this edition to 
emphasize the general problem of connecting I/O devices, processors, and mem- 
ory rather than focus exclusively on buses. 

Bus Basics 

Classically, a bus generally contains a set of control lines and a set of data lines. 
The control lines are used to signal requests and acknowledgments, and to indi- 
cate what type of information is on the data lines. The data lines of the bus carry 
information between the source and the destination. This information may con- 
sist of data, complex commands, or addresses. For example, if a disk wants to 
write some data into memory from a disk sector, the data lines will be used to 
indicate the address in memory in which to place the data as well as to carry the 
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bus transaction A sequence of 
bus operations that includes a 
request and may include a 
response, either of which may 
carry data. A transaction is initi- 
ated by a single request and may 
take many individual bus opera- 
tions. 



processor-memory bus A bus 
that connects processor and 
memory and that is short, gen- 
erally high speed, and matched 
to the memory system so as to 
maximize memory-processor 
bandwidth. 

backplane bus A bus that is 
designed to allow processors, 
memory, and I/O devices to 
coexist on a single bus. 



synchronous bus A bus that 
includes a clock in the control 
lines and a fixed protocol for 
communicating that is relative 
to the clock. 

asynchronous bus A bus that 
uses a handshaking protocol for 
coordinating usage rather than a 
clock; can accommodate a wide 
variety of devices of differing 
speeds. 



actual data from the disk. The control lines will be used to indicate what type of 
information is contained on the data lines of the bus at each point in the transfer. 
Some buses have two sets of signal lines to separately communicate both data and 
address in a single bus transmission. In either case, the control lines are used to 
indicate what the bus contains and to implement the bus protocol. And because 
the bus is shared, we also need a protocol to decide who uses it next; we will dis- 
cuss this problem shortly. 

Let's consider a typical bus transaction. A bus transaction includes two 
parts: sending the address and receiving or sending the data. Bus transactions are 
typically defined by what they do to memory. A read transaction transfers data 
from memory (to either the processor or an I/O device), and a write transaction 
writes data to the memory. Clearly, this terminology is confusing. To avoid this, 
we'll try to use the terms input and output, which are always defined from the per- 
spective of the processor: an input operation is inputting data from the device to 
memory, where the processor can read it, and an output operation is outputting 
data to a device from memory where the processor wrote it. 

Buses are traditionally classified as processor-memory buses or I/O buses. Pro- 
cessor-memory buses are short, generally high speed, and matched to the memory 
system so as to maximize memory-processor bandwidth. I/O buses, by contrast, 
can be lengthy, can have many types of devices connected to them, and often have 
a wide range in the data bandwidth of the devices connected to them. I/O buses 
do not typically interface directly to the memory but use either a processor-mem- 
ory or a backplane bus to connect to memory. Other buses with different charac- 
teristics have emerged for special functions, such as graphics buses. 

The I/O bus serves as a way of expanding the machine and connecting new 
peripherals. To make this easier, the computer industry has developed several 
standards. The standards serve as a specification for the computer manufacturer 
and for the peripheral manufacturer. A standard ensures the computer designer 
that peripherals will be available for a new machine, and it ensures the peripheral 
builder that users will be able to hook up their new equipment. Figure 8.9 sum- 
marizes the key characteristics of the two dominant I/O bus standards: Firewire 
and USB. They connect a variety of devices to the desktop computer, from key- 
boards to cameras to disks. 

The two basic schemes for communication on the bus are synchronous and 
asynchronous. If a bus is synchronous, it includes a clock in the control lines and 
a fixed protocol for communicating that is relative to the clock. For example, for a 
processor-memory bus performing a read from memory, we might have a proto- 
col that transmits the address and read command on the first clock cycle, using 
the control lines to indicate the type of request. The memory might then be 
required to respond with the data word on the fifth clock. This type of protocol 
can be implemented easily in a small finite state machine. Because the protocol is 
predetermined and involves little logic, the bus can run very fast and the interface 
logic will be small. 
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Characteristic 


Firewire (1394) 


USB 2.0 1 


Bus type 


I/O 


1/0 


Basic data bus width (signals) 


4 


2 


Clocking 


asynchronous 


asynchronous 


Theoretical peak bandwidth 


50 MB/sec (Firewire 400) or 
100 MB/sec (Firewire 800) 


0.2 MB/sec (low speed), 

1.5 MB/sec (full speed), 

or 60 MB/sec (high speed) 


Hot plugable 


yes 


yes 


Maximum number of devices 


63 


127 


Maximum bus length 
(copper wire) 


4.5 meters 


5 meters 


Standard name 


IEEE 1394, 1394b 


USE Implementors Forum 



FIGURE 8.9 Key characteristics of two dominant I/O bus standards. 



Synchronous buses have two major disadvantages, however. First, every device 
on the bus must run at the same clock rate. Second, because of clock skew prob- 
lems, synchronous buses cannot be long if they are fast (see © Appendix B for a 
discussion of clock skew). Processor-memory buses are often synchronous 
because the devices communicating are close, small in number, and prepared to 
operate at high clock rates. 

An asynchronous bus is not clocked. Because it is not clocked, an asynchronous 
bus can accommodate a wide variety of devices, and the bus can be lengthened 
without worrying about clock skew or synchronization problems. Both Firewire 
and USE 2.0 are asynchronous. To coordinate the transmission of data between 
sender and receiver, an asynchronous bus uses a handshaking protocol. A hand- 
shaking protocol consists of a series of steps in which the sender and receiver pro- 
ceed to the next step only when both parties agree. The protocol is implemented 
with an additional set of control lines. 

A simple example will illustrate how asynchronous buses work. Let's consider a 
device requesting a word of data from the memory system. Assume that there are 
three control lines: 

1. ReadReq: Used to indicate a read request for memory. The address is put 
on the data lines at the same time. 

2. DataRdy: Used to indicate that the data word is now ready on the data 
lines. In an output transaction, the memory will assert this signal since it is 
providing the data. In an input transaction, an I/O device would assert this 
signal, since it would provide data. In either case, the data is placed on the 
data lines at the same time. 

3. Ack: Used to acknowledge the ReadReq or the DataRdy signal of the other 
party. 



handshaking protocol A 
series of steps used to coordi- 
nate asynchronous bus transfers 
in which the sender and receiver 
proceed to the next step only 
when both parties agree that 
the current step has been 
completed. 
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In an asynchronous protocol, the control signals ReadReq and DataRdy are 
asserted until the other party (the memory or the device) indicates that the con- 
trol lines have been seen and the data lines have been read; this indication is made 
by asserting the Ack line. This complete process is called handshaking. Figure 8.10 
shows how such a protocol operates by depicting the steps in the communication. 

Although much of the bandwidth of a bus is decided by the choice of a syn- 
chronous or asynchronous protocol and the timing characteristics of the bus, sev- 
eral other factors affect the bandwidth that can be attained by a single transfer. 
The most important of these are the data bus width, and whether it supports 
block transfers or it transfers a word at a time. 



ReadReq 



Data 



Ack 




DataRdy 

The steps in the protocol begin immediately after the device signals a request by raising ReadReq and 
putting the address on the Data lines: 

1. When memory sees the ReadReq line, it reads the address from the data bus and raises Ack to 
indicate it has been seen. 

2. I/O device sees the Ack line high and releases the ReadReq and data lines. 

3. Memory sees that ReadReq is low and drops the Ack line to acknowledge the ReadReq signal. 

4. This step starts when the memory has the data ready. It places the data from the read request on 
the data lines and raises DataRdy. 

5. The I/O device sees DataRdy, reads the data from the bus, and signals that it has the data by raising 
Ack. 

6. The memory sees the Ack signal, drops DataRdy, and releases the data lines. 

7. Finally, the I/O device, seeing DataRdy go low, drops the Ack line, which indicates that the 
transmission is completed. 

A new bus transaction can now begin. 

FIGURE 8.10 The asynchronous handshaking protocol consists of seven steps to read a 
word from memory and receive it in an I/O device. The signals in color are those asserted by the 
I/O device, while the memory asserts the signals shown in black. The arrows label the seven steps and the 
event that triggers each step. The symbol showing two lines (high and low) at the same time on the data 
lines indicates that the data lines have valid data at this point. (The symbol indicates that the data is valid, 
but the value is not known.) 
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Elaboration: Another method for increasing the effective bus bandwidth is to 
release the bus when it is not being used for transmitting information. This type of pro- 
tocol is called a split transaction protocol. The advantage of such a protocol is that, by 
freeing the bus during the time data is not being transmitted, the protocol allows 
another requestor to use the bus. This can improve the effective bus bandwidth for the 
entire system if the memory is sophisticated enough to handle multiple overlapping 
transactions. Multiprocessors sharing a memory bus may use split transaction proto- 
cols. 



split transaction protocol A 
protocol in which the bus is 
released during a bus transac- 
tion while the requester is wait- 
ing for the data to be 
transmitted, which frees the bus 
for access by another requester. 



The Buses and Networks of the Pentium 4 

Figure 8.1 1 shows the I/O system of a PC based on the Pentium 4. The processor 
connects to peripherals via two main chips. The chip next to the processor is the 
memory controller hub, commonly called the north bridge, and the one connected 
to it is the I/O controller hub, called the south bridge. 



Main 
memory 

DIMMs 



Disk 



Disk 



Stereo - 
(surround- 
sound) 



DDR 400 
(3.2 GB/sec) 



DDR 400 
(3.2 GB/sec) 



Serial ATA 
(150MB/sec) 



Serial ATA 
(150MB/sec) 



AC/97 
1 MB/sec) 



USB 2.0 
(60 MB/sec) 



Pentium 4 
processor 



System bus (800 MHz, 604 GB/sec) 



Memory 

controller 

hub 

(north bridge) 

82875P 



AGP8X 
(2.1 GB/sec) 



CSA 




(0.266 GB/sec) ^ Gbjt Bher J 

( 



(266 MB/sec) p ara || e | ATA 
(100 MB/sec) 



I/O 

controller 

hub 

(south bridge) 

82801 EB 




CD/DVD 




Parallel ATA 

(100 MB/sec) 




(20 MB/sec) 



J 



10/100 Mbit Ethernet 



r 



PCI bus 
(132 MB/sec) 



FIGURE 8.11 Organization of the I/O system on a Pentium 4 PC using the Intel 875 chip 
set. Note that the maximum transfer rate between the north bridge (memory hub) and south bridge (I/O 
hub) is 266 MB/sec, which is why Intel put the AGP bus and Gigabit Ethernet on the north bridge. 
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The north bridge is basically a DMA controller, connecting the processor to 
memory, the AGP graphic bus, and the south bridge chip. The south bridge con- 
nects the north bridge to a cornucopia of I/O buses. Intel and others offer a wide 
variety of these chip sets to connect the Pentium 4 to the outside world. To give a 
flavor of the options, Figure 8.12 shows two of the chip sets. 

As Moore's law continues, an increasing number of I/O controllers that were 
formerly available as optional cards that connected to I/O buses have been co- 
opted into these chip sets. For example, the south bridge chip of the Intel 875 
includes a striping RAID controller, and the north bridge chip of the Intel 845GL 
includes a graphics controller. 





875P chip set 


845GL chip set 


Target segment 


Performance PC 


Value PC 


System bus (64 bit) 


800/533 MHz 


400 MHz 


Memory controller hub ("north bridge") 




Package size, pins 


42.5 x 42.5 mm, 1005 


37.5 x37.5 mm, 760 


Memory speed 


DDR 400/333/266 SDRAM 


DDR 266/200, PC133 SDRAM 


Memory buses, widths 


2x72 


1x64 


Number of DIMMs, DRAM Mbit 
support 


4, 128/256/512 Mbits 


2, 128/256/512 MBits 


Maximum memory capacity 


4 GB 


2 GB 


Memory error correction available? 


yes 


no 


AGP graphics bus, speed 


yes, 8X or 4X 


no 


Graphics controller 


external 


Internal (Extreme Graphics) 


CSA Gigabit Ethernet interface 


yes 


no 


South bridge interface speed (8 bit) 


266 MHz 


266 MHz 


I/O controller hub ("south bridge") 


Package size, pins 


31 x 31 mm, 460 


31x31 mm, 421 


PCI bus: width, speed, masters 


32-bit, 33 MHz, 6 masters 


32-bit, 33 MHz, 6 masters 


Ethernet MAC controller, interface 


100/10 Mbit 


100/10 Mbit 


USB 2.0 ports, controllers 


8,4 


6,3 


ATA 100 ports 


2 


2 


Serial ATA 150 controller, ports 


yes, 2 


no 


RAID controller 


yes 


no 


AC-97 audio controller, interface 


yes 


yes 


I/O management 


SMbus 2.0, GPIO 


SMbus 2.0, GPIO 



FIGURE 8.12 Two Pentium 4 I/O chip sets from Intel. The 845GL north bridge uses many fewer 
pins than the 875 by having just one memory bus and by omitting the AGP bus and the Gigabit Ethernet 
interface. Note that the serial nature of USB and Serial ATA means that two more USB ports and two more 
Serial ATA ports need just 39 more pins in the south bridge of the 875 versus the 845GL chip sets. 
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These two chips demonstrate the gradual evolution from parallel shared buses 
to high-speed serial point-to-point interconnections with switches via the past 
and future versions of ATA and PCI. 

Serial ATA is a serial successor to the parallel ATA bus used by magnetic and 
optical disks in PCs. The first generation transfers at 150 MB/sec compared to the 
100 MB/sec of parallel ATA- 100 bus. Its distance is 1 meter, twice the maximum 
length of ATA- 100. It uses just 7 wires, with one 2-wire data channel in each direc- 
tion, compared to 80 for ATA-100. 

The south bridge in Figure 8.11 demonstrates the transitory period between 
parallel buses and serial networks by providing both parallel and serial ATA buses. 

PCI Express is a serial successor to the popular PCI bus. Rather than 32-64 
shared wires operating at 33 MHz-133 MHz with a peak bandwidth of 132-1064 
MB/sec, PCI Express uses just 4 wires in each direction operating at 625 MHz to 
offer 300 MB/sec per direction. The bandwidth per pin of PCI Express is 5—10 
times its predecessors. A computer can then afford to have several PCI Express 
interfaces to get even higher bandwidth. 

Although the chips in Figure 8.1 1 only show the parallel PCI bus, Intel plans to 
replace the AGP graphics bus and the bus between the north bridge and the south 
bridge with PCI Express in the next generation of these chips. 

Buses and networks provide electrical interconnection among I/O devices, pro- 
cessors, and memory, and also define the lowest-level protocol for communica- 
tion. Above this basic level, we must define hardware and software protocols for 
controlling data transfers between I/O devices and memory, and for the processor 
to specify commands to the I/O devices. These topics are covered in the next sec- 
tion. 



Both networks and buses connect components together. Which of the following 
are true about them? 

1. Networks and I/O buses are almost always standardized. 

2. Shared media networks and multimaster buses need an arbitration scheme. 

3. Local area networks and processor-memory buses are almost always syn- 
chronous. 

4. High-performance networks and buses use similar techniques compared to 
their lower-performance alternatives: they are wider, send many words per 
transaction, and have separate address and data lines. 



Check 
Yourself 
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8.5 


Interfacing I/O Devices to the Processor, 
Memory, and Operating System 




A bus or network protocol defines how a word or block of data should be commu- 1 
nicated on a set of wires. This still leaves several other tasks that must be per- 1 
formed to actually cause data to be transferred from a device and into the memory 1 
address space of some user program. This section focuses on these tasks and will 1 
answer such questions as the following: 1 

■ How is a user I/O request transformed into a device command and commu- 1 
nicated to the device? 1 

■ How is data actually transferred to or from a memory location? 1 

■ What is the role of the operating system? 1 

As we will see in answering these questions, the operating system plays a major 1 
role in handling I/O, acting as the interface between the hardware and the pro- 1 
gram that requests I/O. 1 
The responsibilities of the operating system arise from three characteristics of 1 
I/O systems: 1 

1 . Multiple programs using the processor share the I/O system. 1 

2. I/O systems often use interrupts (externally generated exceptions) to com- 1 
municate information about I/O operations. Because interrupts cause a 

transfer to kernel or supervisor mode, they must be handled by the operat- 1 
ing system (OS). 1 

3. The low-level control of an I/O device is complex because it requires man- 1 
aging a set of concurrent events and because the requirements for correct 1 
device control are often very detailed. 1 


Hardware The three characteristics of I/O systems above lead to several different functions 1 
Software the OS must provide: 

Interface B ^' le ^ guarantees that a user's program accesses only the portions of an 1 

I/O device to which the user has rights. For example, the OS must not allow 1 
a program to read or write a file on disk if the owner of the file has not 1 
granted access to this program. In a system with shared I/O devices, protec- 1 
tion could not be provided if user programs could perform I/O directly. 1 

■ The OS provides abstractions for accessing devices by supplying routines 1 
that handle low-level device operations. 1 
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■ The OS handles the interrupts generated by I/O devices, just as it handles 
the exceptions generated by a program. 

■ The OS tries to provide equitable access to the shared I/O resources, as well 
as schedule accesses in order to enhance system throughput. 

To perform these functions on behalf of user programs, the operating system 
must be able to communicate with the I/O devices and to prevent the user pro- 
gram from communicating with the I/O devices directly. Three types of commu- 
nication are required: 

1. The OS must be able to give commands to the I/O devices. These com- 
mands include not only operations like read and write, but also other oper- 
ations to be done on the device, such as a disk seek. 

2. The device must be able to notify the OS when the I/O device has com- 
pleted an operation or has encountered an error. For example, when a disk 
completes a seek, it will notify the OS. 

3. Data must be transferred between memory and an I/O device. For example, 
the block being read on a disk read must be moved from disk to memory. 

In the next few sections, we will see how these communications are performed. 



Giving Commands to I/O Devices 

To give a command to an I/O device, the processor must be able to address the 
device and to supply one or more command words. Two methods are used to 
address the device: memory-mapped I/O and special I/O instructions. In 
memory-mapped I/O, portions of the address space are assigned to I/O devices. 
Reads and writes to those addresses are interpreted as commands to the I/O 
device. 

For example, a write operation can be used to send data to an I/O device where 
the data will be interpreted as a command. When the processor places the address 
and data on the memory bus, the memory system ignores the operation because 
the address indicates a portion of the memory space used for I/O. The device con- 
troller, however, sees the operation, records the data, and transmits it to the device 
as a command. User programs are prevented from issuing I/O operations directly 
because the OS does not provide access to the address space assigned to the I/O 
devices and thus the addresses are protected by the address translation. Memory- 
mapped I/O can also be used to transmit data by writing or reading to select 
addresses. The device uses the address to determine the type of command, and the 
data may be provided by a write or obtained by a read. In any event, the address 
encodes both the device identity and the type of transmission between processor 
and device. 



memory-mapped I/O An I/O 
scheme in which portions of 
address space are assigned to 
I/O devices and reads and writes 
to those addresses are inter- 
preted as commands to the I/O 
device. 
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Actually performing a read or write of data to fulfill a program request usually 
requires several separate I/O operations. Furthermore, the processor may have to 
interrogate the status of the device between individual commands to determine 
whether the command completed successfully. For example, a simple printer has 
two I/O device registers — one for status information and one for data to be 
printed. The Status register contains a done bit, set by the printer when it has 
printed a character, and an error bit> indicating that the printer is jammed or out 
of paper. Each byte of data to be printed is put into the Data register. The proces- 
sor must then wait until the printer sets the done bit before it can place another 
character in the buffer. The processor must also check the error bit to determine if 
a problem has occurred. Each of these operations requires a separate I/O device 
access. 



I/O instructions A dedicated 
instruction that is used to give a 
command to an I/O device and 
that specifies both the device 
number and the command 
word (or the location of the 
command word in memory). 



Elaboration: The alternative to memory-mapped I/O is to use dedicated I/O instruc- 
tions in the processor. These I/O instructions can specify both the device number and 
the command word (or the location of the command word in memory). The processor 
communicates the device address via a set of wires normally included as part of the 
I/O bus. The actual command can be transmitted over the data lines in the bus. Exam- 
ples of computers with I/O instructions are the Intel IA-32 and the IBM 370 computers. 
By making the I/O instructions illegal to execute when not in kernel or supervisor 
mode, user programs can be prevented from accessing the devices directly. 



polling The process of periodi- 
cally checking the status of an 
I/O device to determine the 
need to service the device. 



interrupt- driven I/O An I/O 

scheme that employs interrupts 
to indicate to the processor that 
an I/O device needs attention. 



Communicating with the Processor 

The process of periodically checking status bits to see if it is time for the next I/O 
operation, as in the previous example, is called polling. Polling is the simplest way 
for an I/O device to communicate with the processor. The I/O device simply puts 
the information in a Status register, and the processor must come and get the 
information. The processor is totally in control and does all the work. 

Polling can be used in several different ways. Real-time embedded applications 
poll the I/O devices since the I/O rates are predetermined and it makes I/O over- 
head more predictable, which is helpful for real time. As we will see, this allows 
polling to be used even when the I/O rate is somewhat higher. 

The disadvantage of polling is that it can waste a lot of processor time because 
processors are so much faster than I/O devices. The processor may read the Status 
register many times, only to find that the device has not yet completed a compara- 
tively slow I/O operation, or that the mouse has not budged since the last time it 
was polled. When the device completes an operation, we must still read the status 
to determine whether it was successful. 

The overhead in a polling interface was recognized long ago, leading to the 
invention of interrupts to notify the processor when an I/O device requires atten- 
tion from the processor. Interrupt-driven I/O, which is used by almost all systems 
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for at least some devices, employs I/O interrupts to indicate to the processor that 
an I/O device needs attention. When a device wants to notify the processor that it 
has completed some operation or needs attention, it causes the processor to be 
interrupted. 

An I/O interrupt is just like the exceptions we saw in Chapters 5, 6, and 7, with 
two important exceptions: 

1. An I/O interrupt is asynchronous with respect to the instruction execution. 
That is, the interrupt is not associated with any instruction and does not 
prevent the instruction completion. This is very different from either page 
fault exceptions or exceptions such as arithmetic overflow. Our control unit 
need only check for a pending I/O interrupt at the time it starts a new 
instruction. 

2. In addition to the fact that an I/O interrupt has occurred, we would like to 
convey further information such as the identity of the device generating the 
interrupt. Furthermore, the interrupts represent devices that may have dif- 
ferent priorities and whose interrupt requests have different urgencies asso- 
ciated with them. 

To communicate information to the processor, such as the identity of the 
device raising the interrupt, a system can use either vectored interrupts or an 
exception Cause register. When the processor recognizes the interrupt, the device 
can send either the vector address or a status field to place in the Cause register. As 
a result, when the OS gets control, it knows the identity of the device that caused 
the interrupt and can immediately interrogate the device. An interrupt mecha- 
nism eliminates the need for the processor to poll the device and instead allows 
the processor to focus on executing programs. 

Interrupt Priority Levels 

To deal with the different priorities of the I/O devices, most interrupt mechanisms 
have several levels of priority: UNIX operating systems use four to six levels. These 
priorities indicate the order in which the processor should process interrupts. 
Both internally generated exceptions and external I/O interrupts have priorities; 
typically, I/O interrupts have lower priority than internal exceptions. There may 
be multiple I/O interrupt priorities, with high-speed devices associated with the 
higher priorities. 

To support priority levels for interrupts, MIPS provides the primitives that let 
the operating system implement the policy, similar to how MIPS handles TLB 
misses. Figure 8.13 shows the key registers, and Section A. 7 in 8JE Appendix A 
gives more details. 

The Status register determines who can interrupt the computer. If the interrupt 
enable bit is 0, then none can interrupt. A more refined blocking of interrupts is 
available in the interrupt mask field. There is a bit in the mask corresponding to 
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FIGURE 8.13 The Cause and Status registers. This version of the Cause register corresponds to 
the MIPS-32 architecture. The earlier MIPS I architecture had three nested sets of kernel/user and interrupt 
enable bits to support nested interrupts. Section A.7 in ] J Appendix A has more detials about these regis- 
ters. 



each bit in the pending interrupt field of the Cause register. To enable the corre- 
sponding interrupt, there must be a 1 in the mask field at that bit position. Once 
an interrupt occurs, the operating system can find the reason in the exception 
code field of the Status register: means an interrupt occurred, with other values 
for the exceptions mentioned in Chapter 7. 

Here are the steps that must occur in handling an interrupt: 

1. Logically AND the pending interrupt field and the interrupt mask field to 
see which enabled interrupts could be the culprit. Copies are made of these 
two registers using the mf cO instruction. 

2. Select the higher priority of these interrupts. The software convention is 
that the leftmost is the highest priority. 

3. Save the interrupt mask field of the Status register. 

4. Change the interrupt mask field to disable all interrupts of equal or lower 
priority. 

5. Save the processor state needed to handle the interrupt. 

6. To allow higher-priority interrupts, set the interrupt enable bit of the Cause 
register to 1 . 

7. Call the appropriate interrupt routine. 

8. Before restoring state, set the interrupt enable bit of the Cause register to 0. 
This allows you to restore the interrupt mask field. 



Appendix A shows an exception handler for a simple I/O task on pages A-36 to 
A- 37. 

How do the interrupt priority levels (IPL) correspond to these mechanisms? The 
I PL is an operating system invention. It is stored in the memory of the process, 
and every process is given an IPL. At the lowest IPL, all interrupts are permitted. 
Conversely, at the highest IPL, all interrupts are blocked. Raising and lowering the 
IPL involves changes to the interrupt mask field of the Status register. 



Elaboration: The two least significant bits of the pending interrupt and interrupt 
mask fields are for software interrupts, which are lower priority. These are typically 
used by higher-priority interrupts to leave work for lower-priority interrupts to do once 
the immediate reason for the interrupt is handled. Once the higher-priority interrupt is 
finished, the lower-priority tasks will be noticed and handled. 



Transferring the Data between a Device and Memory 

We have seen two different methods that enable a device to communicate with the 
processor. These two techniques — polling and I/O interrupts — form the basis for 
two methods of implementing the transfer of data between the I/O device and 
memory. Both these techniques work best with lower-bandwidth devices, where 
we are more interested in reducing the cost of the device controller and interface 
than in providing a high-bandwidth transfer. Both polling and interrupt-driven 
transfers put the burden of moving data and managing the transfer on the proces- 
sor. After looking at these two schemes, we will examine a scheme more suitable 
for higher-performance devices or collections of devices. 

We can use the processor to transfer data between a device and memory based 
on polling. In real-time applications, the processor loads data from I/O device 
registers and stores them into memory. 

An alternative mechanism is to make the transfer of data interrupt driven. In 
this case, the OS would still transfer data in small numbers of bytes from or to the 
device. But because the I/O operation is interrupt driven, the OS simply works on 
other tasks while data is being read from or written to the device. When the OS 
recognizes an interrupt from the device, it reads the status to check for errors. If 
there are none, the OS can supply the next piece of data, for example, by a 
sequence of memory-mapped writes. When the last byte of an I/O request has 
been transmitted and the I/O operation is completed, the OS can inform the pro- 
gram. The processor and OS do all the work in this process, accessing the device 
and memory for each data item transferred. 

Interrupt-driven I/O relieves the processor from having to wait for every I/O 
event, although if we used this method for transferring data from or to a hard 
disk, the overhead could still be intolerable, since it could consume a large frac- 
tion of the processor when the disk was transferring. For high-bandwidth devices 
like hard disks, the transfers consist primarily of relatively large blocks of data 
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direct memory access (DMA) 
A mechanism that provides a 
device controller the ability to 
transfer data directly to or from 
the memory without involving 
the processor. 

bus master A unit on the bus 
that can initiate bus requests. 



(hundreds to thousands of bytes). Thus, computer designers invented a mecha- 
nism for offloading the processor and having the device controller transfer data 
directly to or from the memory without involving the processor. This mechanism 
is called direct memory access (DMA). The interrupt mechanism is still used by 
the device to communicate with the processor, but only on completion of the I/O 
transfer or when an error occurs. 

DMA is implemented with a specialized controller that transfers data between 
an I/O device and memory independent of the processor. The DMA controller 
becomes the bus master and directs the reads or writes between itself and mem- 
ory. There are three steps in a DMA transfer: 

1 . The processor sets up the DMA by supplying the identity of the device, the 
operation to perform on the device, the memory address that is the source 
or destination of the data to be transferred, and the number of bytes to 
transfer. 

2. The DMA starts the operation on the device and arbitrates for the bus. 
When the data is available (from the device or memory), it transfers the 
data. The DMA device supplies the memory address for the read or the 
write. If the request requires more than one transfer on the bus, the DMA 
unit generates the next memory address and initiates the next transfer. 
Using this mechanism, a DMA unit can complete an entire transfer, which 
may be thousands of bytes in length, without bothering the processor. 
Many DMA controllers contain some memory to allow them to deal flexi- 
bly with delays either in transfer or those incurred while waiting to become 
bus master. 

3. Once the DMA transfer is complete, the controller interrupts the processor, 
which can then determine by interrogating the DMA device or examining 
memory whether the entire operation completed successfully. 

There may be multiple DMA devices in a computer system. For example, in a 
system with a single processor-memory bus and multiple I/O buses, each I/O bus 
controller will often contain a DMA processor that handles any transfers between 
a device on the I/O bus and the memory. 

Unlike either polling or interrupt-driven I/O, DMA can be used to interface a 
hard disk without consuming all the processor cycles for a single I/O. Of course, if 
the processor is also contending for memory, it will be delayed when the memory 
is busy doing a DMA transfer. By using caches, the processor can avoid having to 
access memory most of the time, thereby leaving most of the memory bandwidth 
free for use by I/O devices. 



Elaboration: To further reduce the need to interrupt the processor and occupy it in 
handling an I/O request that may involve doing several actual operations, the I/O con- 
troller can be made more intelligent. Intelligent controllers are often called I/O proces- 



sors (as well as I/O controllers or channel controllers). These specialized processors 
basically execute a series of I/O operations, called an I/O program. The program may 
be stored in the I/O processor, or it may be stored in memory and fetched by the I/O 
processor. When using an I/O processor, the operating system typically sets up an 
I/O program that indicates the I/O operations to be done as well as the size and 
transfer address for any reads or writes. The I/O processor then takes the operations 
from the I/O program and interrupts the processor only when the entire program is 
completed. DMA processors are essentially special-purpose processors (usually single- 
chip and nonprogrammable), while I/O processors are often implemented with general- 
purpose microprocessors, which run a specialized I/O program. 



Direct Memory Access and the Memory System 

When DMA is incorporated into an I/O system, the relationship between the 
memory system and processor changes. Without DMA, all accesses to the memory 
system come from the processor and thus proceed through address translation 
and cache access as if the processor generated the references. With DMA, there is 
another path to the memory system — one that does not go through the address 
translation mechanism or the cache hierarchy. This difference generates some 
problems in both virtual memory systems and systems with caches. These prob- 
lems are usually solved with a combination of hardware techniques and software 
support. 

The difficulties in having DMA in a virtual memory system arise because pages 
have both a physical and a virtual address. DMA also creates problems for systems 
with caches because there can be two copies of a data item: one in the cache and 
one in memory. Because the DMA processor issues memory requests directly to 
the memory rather than through the processor cache, the value of a memory loca- 
tion seen by the DMA unit and the processor may differ. Consider a read from 
disk that the DMA unit places directly into memory. If some of the locations into 
which the DMA writes are in the cache, the processor will receive the old value 
when it does a read. Similarly, if the cache is write-back, the DMA may read a 
value directly from memory when a newer value is in the cache, and the value has 
not been written back. This is called the stale data problem or coherence problem. 



In a system with virtual memory, should DMA work with virtual addresses or 
physical addresses? The obvious problem with virtual addresses is that the DMA 
unit will need to translate the virtual addresses to physical addresses. The major 
problem with the use of a physical address in a DMA transfer is that the transfer 
cannot easily cross a page boundary. If an I/O request crossed a page boundary, 
then the memory locations to which it was being transferred would not necessar- 
ily be contiguous in the virtual memory. Consequently, if we use physical 
addresses, we must constrain all DMA transfers to stay within one page. 



Hardware 
Software 
Interface 



One method to allow the system to initiate DMA transfers that cross page 
boundaries is to make the DMA work on virtual addresses. In such a system, the 
DMA unit has a small number of map entries that provide virtual-to-physical 
mapping for a transfer. The operating system provides the mapping when the I/O 
is initiated. By using this mapping, the DMA unit need not worry about the loca- 
tion of the virtual pages involved in the transfer. 

Another technique is for the operating system to break the DMA transfer into a 
series of transfers, each confined within a single physical page. The transfers are 
then chained together and handed to an I/O processor or intelligent DMA unit 
that executes the entire sequence of transfers; alternatively, the operating system 
can individually request the transfers. 

Whichever method is used, the operating system must still cooperate by not 
remapping pages while a DMA transfer involving that page is in progress. 



We have looked at three different methods for transferring data between an I/O 
device and memory. In moving from polling to an interrupt-driven to a DMA 
interface, we shift the burden for managing an I/O operation from the processor 
to a progressively more intelligent I/O controller. These methods have the advan- 
tage of freeing up processor cycles. Their disadvantage is that they increase the 
cost of the I/O system. Because of this, a given computer system can choose which 
point along this spectrum is appropriate for the I/O devices connected to it. 

Before discussing the design of I/O systems, let's look briefly at performance 
measures of them. 



Hardware 
Software 
Interface 



The coherency problem for I/O data is avoided by using one of three major tech- 
niques. One approach is to route the I/O activity through the cache. This ensures 
that reads see the latest value while writes update any data in the cache. Routing all 
I/O through the cache is expensive and potentially has a large negative perfor- 
mance impact on the processor, since the I/O data is rarely used immediately and 
may displace useful data that a running program needs. A second choice is to have 
the OS selectively invalidate the cache for an I/O read or force write-backs to 
occur for an I/O write (often called cache flushing). This approach requires some 
small amount of hardware support and is probably more efficient if the software 
can perform the function easily and efficiently. Because this flushing of large parts 
of the cache need only happen on DMA block accesses, it will be relatively infre- 
quent. The third approach is to provide a hardware mechanism for selectively 
flushing (or invalidating) cache entries. Hardware invalidation to ensure cache 
coherence is typical in multiprocessor systems, and the same technique can be 
used for I/O; we discuss this topic in detail in Chapter 9. 
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In ranking of the three ways of doing I/O, which statements are true? 

1. If we want the lowest latency for an I/O operation to a single I/O device, the 
order is polling, DMA, and interrupt driven. 

2. In terms of lowest impact on processor utilization from a single I/O device, 
the order is DMA, interrupt driven, and polling 



Check 
Yourself 



8.6 



I/O Performance Measures: Examples 
from Disk and File Systems 



How should we compare I/O systems? This is a complex question because I/O 
performance depends on many aspects of the system and different applications 
stress different aspects of the I/O system. Furthermore, a design can make com- 
plex trade-offs between response time and throughput, making it impossible to 
measure just one aspect in isolation. For example, handling a request as early as 
possible generally minimizes response time, although greater throughput can be 
achieved if we try to handle related requests together. Accordingly, we may 
increase throughput on a disk by grouping requests that access locations that are 
close together. Such a policy will increase the response time for some requests, 
probably leading to a larger variation in response time. Although throughput will 
be higher, some benchmarks constrain the maximum response time to any 
request, making such optimizations potentially problematic. 

In this section, we give some examples of measurements proposed for deter- 
mining the performance of disk systems. These benchmarks are affected by a 
variety of system features, including the disk technology, how disks are con- 
nected, the memory system, the processor, and the file system provided by the 
operating system. 

Before we discuss these benchmarks, we need to address a confusing point 
about terminology and units. The performance of I/O systems depends on the 
rate at which the system transfers data. The transfer rate depends on the clock 
rate, which is typically given in GHz = 10 cycles per second. The transfer rate is 
usually quoted in GB/sec. In I/O systems, GBs are measured using base 10 (i.e., 1 
GB = 10 = 1,000,000,000 bytes), unlike main memory where base 2 is used (i.e., 1 
GB = 2 30 = 1,073,741,824). In addition to adding confusion, this difference intro- 
duces the need to convert between base 10 (IK = 1000) and base 2 (IK = 1024) 
because many I/O accesses are for data blocks that have a size that is a power of 
two. Rather than complicate all our examples by accurately converting one of the 
two measurements, we make note here of this distinction and the fact that treating 
the two measures as if the units were identical introduces a small error. We illus- 
trate this error in Section 8.9. 
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transaction processing A type 
of application that involves han- 
dling small short operations 
(called transactions) that typi- 
cally require both I/O and com- 
putation. Transaction 
processing applications typi- 
cally have both response time 
requirements and a perfor- 
mance measurement based on 
the throughput of transactions. 

I/O rate Performance measure 
of I/Os per unit time, such as 
reads per second. 

data rate Performance mea- 
sure of bytes per unit time, such 
as GB/second. 



Transaction Processing I/O Benchmarks 

Transaction processing (TP) applications involve both a response time require- 
ment and a performance measurement based on throughput. Furthermore, most 
of the I/O accesses are small. Because of this, TP applications are chiefly con- 
cerned with I/O rate, measured as the number of disk accesses per second, as 
opposed to data rate, measured as bytes of data per second. TP applications gen- 
erally involve changes to a large database, with the system meeting some response 
time requirements as well as gracefully handling certain types of failures. These 
applications are extremely critical and cost-sensitive. For example, banks nor- 
mally use TP systems because they are concerned about a range of characteristics. 
These include making sure transactions aren't lost, handling transactions quickly, 
and minimizing the cost of processing each transaction. Although dependability 
in the face of failure is an absolute requirement in such systems, both response 
time and throughput are critical to building cost-effective systems. 

A number of transaction processing benchmarks have been developed. The 
best-known set of benchmarks is a series developed by the Transaction Processing 
Council (TPC). 

TPC-C, initially created in 1992, simulates a complex query environment. 
TPC-H models ad hoc decision support — the queries are unrelated and knowl- 
edge of past queries cannot be used to optimize future queries; the result is that 
query execution times can be very long. TPC-R simulates a business decision sup- 
port system where users run a standard set of queries. In TPC-R, preknowledge of 
the queries is taken for granted, and the DBMS can be optimized to run these que- 
ries. TPC-W is a Web-based transaction benchmark that simulates the activities of 
a business-oriented transactional Web server. It exercises the database system as 
well as the underlying Web server software. The TPC benchmarks are described at 
www.tpc.org. 

All the TPC benchmarks measure performance in transactions per second. In 
addition, they include a response time requirement, so that throughput perfor- 
mance is measured only when the response time limit is met. To model real-world 
systems, higher transaction rates are also associated with larger systems, both in 
terms of users and the size of the database that the transactions are applied to. 
Finally, the system cost for a benchmark system must also be included, allowing 
accurate comparisons of cost-performance. 



File System and Web I/O Benchmarks 

File systems, which are stored on disks, have a different access pattern. For exam- 
ple, measurements of UNIX file systems in an engineering environment have 
found that 80% of accesses are to files of less than 10 KB and that 90% of all file 
accesses are to data with sequential addresses on the disk. Furthermore, 67% of 
the accesses were reads, 27% were writes, and 6% were read-modify-write 
accesses, which read data, modify it, and then rewrite the same location. Such 
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measurements have led to the creation of synthetic file system benchmarks. One 
of the most popular of such benchmarks has five phases, using 70 files: 

■ MakcDir: Constructs a directory subtree that is identical in structure to the 
given directory subtree 

■ Copy: Copies every file from the source subtree to the target subtree 

■ ScanDir: Recursively traverses a directory subtree and examines the status 
of every file in it 

■ ReadAll: Scans every byte of every file in a subtree once 

■ Make: Compiles and links all the files in a subtree 

As we will see in Section 8.7, the design of an I/O system involves knowing what 
the workload is. 

In addition to processor benchmarks, SPEC offers both a file server benchmark 
(SPECSFS) and a Web server benchmark (SPECWeb). SPECSFS is a benchmark 
for measuring NFS (Network File System) performance using a script of file server 
requests; it tests the performance of the I/O system, including both disk and net- 
work I/O, as well as the processor. SPECSFS is a throughput-oriented benchmark 
but with important response time requirements. SPECWeb is a Web server bench- 
mark that simulates multiple clients requesting both static and dynamic pages 
from a server, as well as clients posting data to the server. 

I/O Performance versus Processor Performance 

Amdahl's law in Chapter 2 reminds us that neglecting I/O is dangerous. A simple 
example demonstrates this. 



Impact of I/O on System Performance 

Suppose we have a benchmark that executes in 100 seconds of elapsed time, 
where 90 seconds is CPU time and the rest is I/O time. If CPU time improves 
by 50% per year for the next five years but I/O time doesn't improve, how 
much faster will our program run at the end of five years? 



We know that 



Elapsed time = CPU time + I/O time 
100 = 90 + I/O time 
I/O time = 10 seconds 



EXAMPLE 



ANSWER 
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The new CPU times and the resulting elapsed times are computed in the fol- 
lowing table: 



After n years 



CPU time 



I/O time 



Elapsed time |( % I/O time 






90 seconds 


10 seconds 


100 seconds 


10% 


1 


1.5 


60 seconds 


10 seconds 


70 seconds 


14% 


2 


SO. = 

1.5 


40 seconds 


10 seconds 


50 seconds 


20% 


3 


MX = 

1.5 


i 27 seconds 


10 seconds 


37 seconds 


27% 


4 


-2Z - 
1.5 


18 seconds 


10 seconds 


28 seconds 


36% 


5 


-!& = 

1.5 


12 seconds 


10 seconds 


22 seconds 


45% 



The improvement in CPU performance over five years is 

25 = 7.5 
12 

However, the improvement in elapsed time is only 



100 

22 



= 4.5 



and the I/O time has increased from 10% to 45% of the elapsed time. 



Check 
Yourself 



Are the following true or false? Unlike processor benchmarks, I/O benchmarks 

1. concentrate on throughput rather than latency 

2. can require that the data set scale in size or number of users to achieve per- 
formance milestones 

3. come from organizations rather than from individuals 



8.7 



Designing an I/O System 



There are two primary types of specifications that designers encounter in I/O sys- 
tems: latency constraints and bandwidth constraints. In both cases, knowledge of 
the traffic pattern affects the design and analysis. 

Latency constraints involve ensuring that the latency to complete an I/O opera- 
tion is bounded by a certain amount. In the simple case, the system may be 
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unloaded, and the designer must ensure that some latency bound is met either 
because it is critical to the application or because the device must receive certain 
guaranteed service to prevent errors. Examples of the latter are similar to the anal- 
ysis we looked at in the previous section. Likewise, determining the latency on an 
unloaded system is relatively easy, since it involves tracing the path of the I/O 
operation and summing the individual latencies. 

Finding the average latency (or distribution of latency) under a load is a much 
more complex problem. Such problems are tackled either by queuing theory 
(when the behavior of the workload requests and I/O service times can be approx- 
imated by simple distributions) or by simulation (when the behavior of I/O events 
is complex). Both topics are beyond the limits of this text. 

Designing an I/O system to meet a set of bandwidth constraints given a work- 
load is the other typical problem designers face. Alternatively, the designer may be 
given a partially configured I/O system and be asked to balance the system to main- 
tain the maximum bandwidth achievable as dictated by the preconfigured portion 
of the system. This latter design problem is a simplified version of the first. 

The general approach to designing such a system is as follows: 

1. Find the weakest link in the I/O system, which is the component in the I/O 
path that will constrain the design. Depending on the workload, this com- 
ponent can be anywhere, including the CPU, the memory system, the back- 
plane bus, the I/O controllers, or the devices. Both the workload and 
configuration limits may dictate where the weakest link is located. 

2. Configure this component to sustain the required bandwidth. 

3. Determine the requirements for the rest of the system and configure them 
to support this bandwidth. 

The easiest way to understand this methodology is with an example. 



I/O System Design 

Consider the following computer system: 

■ A CPU that sustains 3 billion instructions per second and averages 
100,000 instructions in the operating system per I/O operation 

■ A memory backplane bus capable of sustaining a transfer rate of 1000 
MB/sec 

■ SCSI Ultra320 controllers with a transfer rate of 320 MB/sec and accom- 
modating up to 7 disks 

■ Disk drives with a read/write bandwidth of 75 MB/sec and an average 
seek plus rotational latency of 6 ms 



EXAMPLE 
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ANSWER 



If the workload consists of 64 KB reads (where the block is sequential on a 
track) and the user program needs 200,000 instructions per I/O operation, 
find the maximum sustainable I/O rate and the number of disks and SCSI 
controllers required. Assume that the reads can always be done on an idle 
disk if one exists (i.e., ignore disk conflicts). 



The two fixed components of the system are the memory bus and the CPU. 
Let's first find the I/O rate that these two components can sustain and deter- 
mine which of these is the bottleneck. Each I/O takes 200,000 user instruc- 
tions and 100,000 OS instructions, so 



Maximum I/O rate of CPU = 

Instruction execution rate _ 

Instructions per I/O (200 

Each I/O transfers 64 KB, so 



3X 10 



100) x 10 



- = 10,000 I/Qs 
3 second 



Maximum I/O rate of bus = Bus bandwidth = 1000x10 = 15)625 _IZQs_ 

Bytes per I/O 54 x jq 3 second 

The CPU is the bottleneck, so we can now configure the rest of the system to 
perform at the level dictated by the CPU, 10,000 I/Os per second. 

Let's determine how many disks we need to be able to accommodate 10,000 
I/Os per second. To find the number of disks, we first find the time per I/O op- 
eration at the disk: 

Time per I/O at disk = Seek + rotational time + Transfer time 

= 6 ms + 64KB = 6.9 ms 
75 MB/sec 

Thus, each disk can complete 1000 ms/6.9 ms or 146 I/Os per second. To sat- 
urate the CPU requires 10,000 I/Os per second, or 10,000/146 « 69 disks. 

To compute the number of SCSI buses, we need to check the average trans- 
fer rate per disk to see if we can saturate the bus, which is given by 

Transfer rate = Transfer size = 64KB = g 56 MB/sec 

Transfer time 6.9 ms 

The maximum number of disks per SCSI bus is 7, which won't saturate this 
bus. This means we will need 69/7, or 10 SCSI buses and controllers. 
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Notice the significant number of simplifying assumptions that are needed to do 
this example. In practice, many of these simplifications might not hold for critical 
I/O-intensive applications (such as databases). For this reason, simulation is often 
the only realistic way to predict the I/O performance of a realistic workload. 



8.8 



Real Stuff: A Digital Camera 



Digital cameras are basically embedded computers with removable, writable, non- 
volatile, storage, and interesting I/O devices. Figure 8. 14 shows our example. 




FIGURE 8.14 The Sanyo VPC-SX500 with Flash memory card and IBM Microdrive. 

Although newer cameras offer more pixels per picture, the principles are the same. This 1360 x 1024 pixel 
digital camera stores pictures either using CompactFlash memory or using a IBM Microdrive. This photo 
was taken using a 340 MB microdrive and a 8 MB CompactFlash memory. As Figure 8.15 shows, in 2004 the 
capacities are as large as 1 GB to 4 GB. It is 4.3 inches wide x 2.5 inches high x 1.6 inches deep, and it weighs 
7.4 ounces. In addition to taking a still picture and converting it to JPEG format every 0.9 seconds, it can 
record a Quick Time video clip at VGA size (640 x 480). One technological advantage is the use of a custom 
system on a chip to reduce size and power, so the camera only needs two AA batteries to operate versus four 
in other digital cameras. 
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When powered on, the microprocessor first runs diagnostics on all compo- 
nents and writes any error messages to the liquid crystal display (LCD) on the 
back of the camera. This camera uses a 1.8-inch low-temperature polysilicon TFT 
color LCD. When photographers take pictures, they first hold the shutter halfway 
so that the microprocessor can take a light reading. The microprocessor then 
keeps the shutter open to get the necessary light, which is captured by a charged- 
couple device (CCD) as red, green, and blue pixels. 

For the camera in Figure 8.14, the CCD is a 1/2-inch, 1360 x 1024 pixel, pro- 
gressive-scan chip. The pixels are scanned out row by row and then passed 
through routines for white balance, color, and aliasing correction, and then stored 
in a 4 MB frame buffer. The next step is to compress the image into a standard for- 
mat, such as JPEG, and store it in the removable Flash memory. The photographer 
picks the compression, in this camera called either fine or normal, with a com- 
pression ratio of 10 to 20 times. A fine-quality compressed image takes less than 
0.5 MB, and a normal-quality compressed image takes about 0.25 MB. The micro- 
processor then updates the LCD display to show that there is room for one less 
picture. 

Although the previous paragraph covers the basics of a digital camera, there are 
many more features that are included: showing the recorded images on the color 
LCD display; sleep mode to save battery life; monitoring battery energy; buffering 
to allow recording a rapid sequence of uncompressed images; and, in this camera, 
video recording using MPEG format and audio recording using WAV format. 

This camera allows the photographer to use a Microdrive disk instead of Com- 
pactFlash memory. Figure 8.15 compares CompactFlash and the IBM Microdrive. 



Characteristics 


Sandisk Type 1 Sandisk Type II Hitachi 4 GB 
CompactFlash CompactFlash Microdrive 
SDCFB-128-768 SDCFB-1000-768 DSCM-10340 


Formatted data capacity (MB) 


128 


1000 


4000 


Bytes per sector 


512 


512 


512 


Data transfer rate (MB/sec) 


4 (burst) 


4 (burst) 


4-7 


Link speed to buffer (MB/sec) 


6 


6 


33 


Power standby/operating (W) 


0.15/0.66 


0.15/0.66 


0.07/0.83 


Size: height x width x depth (inches) 


1.43x1.68x0.13 


1.43x1.68x0.13 


1.43x1.68x0.16 


Weight in grams (454 grams/pound) 


11.4 


13.5 


16 


Write cycles before sector wear-out 


300,000 


300,000 


not applicable 


Mean time between failures (hours) 


> 1,000,000 


> 1,000,000 


(see caption) 


Best price (2004) 


$40 


$200 


$480 



FIGURE 8.15 Characteristics of three storage alternatives for digital cameras. Hitachi 
matches the Type II form factor in the Microdrive, while the CompactFlash card uses that space to include 
many more Flash chips. Hitachi does not quote MTTF for the 1.0-inch drives, but the service life is five 
years or 8800 powered-on hours, whichever is first. They rotate at 3600 RPM and have 12 ms seek times. 
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The CompactFlash standard package was proposed by Sandisk Corporation in 
1994 for the PCMCIA-ATA cards of portable PCs. Because it follows the ATA 
interface, it simulates a disk interface including seek commands, logical tracks, 
and so on. It includes a built-in controller to support many types of Flash memory 
and to help with chip yield for Flash memories by mapping out bad blocks. 

The electronic brain of this camera is an embedded computer with several spe- 
cial functions embedded on the chip. Figure 8.16 shows the block diagram of a 
chip similar to the one in the camera. Such chips have been called systems on a 
chip (SOC) because they essentially integrate into a single chip all the parts that 
were found on a small printed circuit board of the past. SOC generally reduces 
size and lowers power compared to less integrated solutions. The manufacturer 
claims the SOC enables the camera to operate on half the number of batteries and 
to offer a smaller form factor than competitors' cameras. 




SDRAM 



Smart 

Media 



Flash 
(program | 



DRAM 



10 bits 



16 bits 



SDRAM 

controller 



Signal 
processor 



MJPEG 



NTSC/PAL 

encoder 




2-channel 
video D/A 



16 bits 



32 bits 



Signal bus 



Bus bridge 



SSFDC 

controller 



16 bits 



RISC 



Audio 
D/A.A/D 



DRAM 
controller 



UART 
x2 



IrDA 



PCMCIA 

controller 



SIO 

PIO 

PWM 



DMA 

controller 



CPU bus 



RS-232 IrDA PCMCIA Others 
port card 



LCD/TV 



— MIC 

Speaker 



FIGURE 8.16 The system on a chip (SOC) found in Sanyo digital cameras. This block dia- 
gram is for the predecessor of the SOC in the camera in Figure 8.14. The successor SOC, called Super 
Advanced IC, uses three buses instead of two, operates at 60 MHz, consumes 800 mW, and fits 3.1M transis- 
tors in a 10.2 x 10.2 mm die using a 0.35-micron process. Note that this embedded system has twice as 
many transistors as the state-of-the-art, high-performance microprocessor in 1990! The SOC in the figure is 
limited to processing 1024 x 768 pixels, but its successor supports 1360 x 1024 pixels. (See Okada, Matsuda, 
Yamada, and Kbbayashi [1999]). 
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For higher performance, it has two buses. The 16-bit bus is for the many slower 
I/O devices: Smart Media interface, program and data memory, and DMA. The 
32-bit bus is for the SDRAM, the signal processor (which is connected to the 
CCD), the Motion JPEG encoder, and the NTSC/PAL encoder (which is con- 
nected to the LCD). Unlike desktop microprocessors, note the large variety of I/O 
buses that this chip must integrate. The 32-bit RISC MPU is a proprietary design 
and runs at 28.8 MHz, the same clock rate as the buses. This 700 mW chip con- 
tains 1.8M transistors in a 10.5 x 10.5 nun die implemented using a 0.35-micron 
process. 



8.9 



Fallacies and Pitfalls 



Fallacy: The rated mean time to failure of disks is 1,200,000 hours or almost 140 
years, so disks practically never fail. 

The current marketing practices of disk manufacturers can mislead users. How is 
such an MTTF calculated? Early in the process manufacturers will put thousands 
of disks in a room, run them for a few months, and count the number that fail. 
They compute MTTF as the total number of hours that the disks were cumula- 
tively up divided by the number that failed. 

One problem is that this number far exceeds the lifetime of a disk, which is 
commonly assumed to be five years or 43,800 hours. For this large MTTF to make 
some sense, disk manufacturers argue that the calculation corresponds to a user 
who buys a disk, and then keeps replacing the disk every five years — the planned 
lifetime of the disk. The claim is that if many customers (and their great- 
grandchildren) did this for the next century, on average they would replace a disk 
27 times before a failure, or about 140 years. 

A more useful measure would be percentage of disks that fail. Assume 1000 
disks with a 1,200,000-hour MTTF and that the disks are used 24 hours a day. If 
you replaced failed disks with a new one having the same reliability characteristics, 
the number that would fail over five years (43,800 hours) is 

Failed disks = 1QQQ drives x 43,800 hours/drive _ 3^ 

1,200,000 hours/failure 

Stated alternatively, 3.6% would tail over the 5-year period. 

Pitfall: Using the peak transfer rate of a portion of the I/O system to make perfor- 
mance projections or performance comparisons. 

Many of the components of an I/O system, from the devices to the controllers to 
the buses, are specified using their peak bandwidths. In practice, these peak band- 
width measurements are often based on unrealistic assumptions about the system 
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or are unattainable because of other system limitations. For example, in quoting 
bus performance, the peak transfer rate is sometimes specified using a memory 
system that is impossible to build. For networked systems, the software overhead 
of initiating communication is ignored. 

The 32-bit, 33 MHz PCI bus has a peak bandwidth of about 133 MB/sec. In 
practice, even for long transfers, it is difficult to sustain more than about 80 
MB/sec for realistic memory systems. As mentioned above, users of wireless net- 
works typically achieve only about a third of the peak bandwidth. 

Amdahl's law also reminds us that the throughput of an I/O system will be lim- 
ited by the lowest-performance component in the I/O path. 

Fallacy: Magnetic disk storage is on its last legs and will be replaced shortly. 

This is both a fallacy and a pitfall. Such claims have been made constantly for the 
past 20 years, though the string of failed alternatives in recent years seems to have 
reduced the level of claims for the death of magnetic storage. Among the unsuc- 
cessful contenders are magnetic bubble memories, optical storage, and holo- 
graphic storage. None of these systems has matched the combination of 
characteristics that favor magnetic disks: high reliability, nonvolatility, low cost, 
reasonable access time, and rapid improvement. Magnetic storage technology 
continues to improve at the same — or faster — pace that it has sustained over the 
past 25 years. 

Pitfall: Using magnetic tapes to back up disks. 

Once again, this is both a fallacy and a pitfall. 

Magnetic tapes have been part of computer systems as long as disks because 
they use similar technology as disks, and hence historically have followed the same 
density improvements. The historic cost-performance difference between disks 
and tapes is based on a sealed, rotating disk having lower access time than sequen- 
tial tape access but removable spools of magnetic tape mean many tapes can be 
used per reader and they can be very long and so have high capacity. Hence, in the 
past a single magnetic tape could hold the contents of many disks, and since it was 
10 to 100 times cheaper per gigabyte than disks, it was a useful backup medium. 

The claim was that magnetic tapes must track disks since innovations in disks 
must help tapes. This claim was important because tapes were a small market and 
could not afford a separate large research and development effort. One reason the 
market is small is that desktop owners generally do not back up disks onto tape, 
and so while desktops are by far the largest market for disks, desktops are a small 
market for tapes. 

Alas, the larger market has led disks to improve much more quickly than tapes. 
Starting in 2000 to 2002, the largest popular disk was larger than the largest popu- 
lar tape. In that same time frame, the price per gigabyte of ATA disks dropped 
below that of tapes. Tape apologists now claim that tapes have compatibility 
requirements that are not imposed on disks; tape readers must read or write the 
current and previous generation of tapes, and must read the last four generations 
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of tapes. As disks are closed systems, disk heads need only read the platters 
enclosed with them, and this advantage explains why disks are improving much 
more rapidly. 

Today, some organizations have dropped tapes altogether, using networks and 
remote disks to replicate the data geographically. The sites are picked so that disas- 
ters would not take out both sites, enabling instantaneous recovery time. (Long 
recovery time is another serious drawback to the serial nature of magnetic tapes.) 
Such a solution depends on advances in disk capacity and network bandwidth to 
make economic sense, but these two are getting much greater investment and 
hence have better recent records of accomplishment than tape. 

Fallacy: A 100 MB/sec bus can transfer 100 MB of data in 1 second. 

First, you generally cannot use 100% of any computer resource. For a bus, you 
would be fortunate to get 70% to 80% of the peak bandwidth. Time to send the 
address, time to acknowledge the signals, and stalls while waiting to use a busy bus 
are among the reasons you cannot use 100% of a bus. 

Second, the definition of a megabyte of storage and a megabyte per second of 
bandwidth do not agree. As we discussed on page 597, I/O bandwidth measures 
are usually quoted in base 10 (i.e., 1 MB/sec = 10 bytes/sec), while 1 MB of data 
is typically a base 2 measure (i.e., 1 MB = 2~ bytes). How significant is this dis- 
tinction? If we could use 100% of the bus for data transfer, the time to transfer 100 
MB of data on a 100-MB/sec bus is actually 

100X2!! = 1,048,576 = L 048576= 1.05 second 

100 X10 6 »>ooo,ooo 

A similar but larger error is introduced when we treat a gigabyte of data trans- 
ferred or stored as equivalent, meaning 10 versus 2 bytes. 

Pitfall: Trying to provide features only within the network versus end to end. 

The concern is providing at a lower level features that can only be accomplished at 
the highest level, thus only partially satisfying the communication demand. 
Saltzer, Reed, and Clark [ 1984] give the end-to-end argument as 

The function in question can completely and correctly he specified only with the 
knowledge and help of the application standing at the endpoints of the commu- 
nication system. Therefore, providing that questioned function as a feature of 
the communication system itself is not possible. 

Their example of the pitfall was a network at MIT that used several gateways, each 
of which added a checksum from one gateway to the next. The programmers of 
the application assumed the checksum guaranteed accuracy, incorrectly believing 
that the message was protected while stored in the memory of each gateway. One 
gateway developed a transient failure that swapped one pair of bytes per million 
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bytes transferred. Over time the source code of one operating system was repeat- 
edly passed through the gateway, thereby corrupting the code. The only solution 
was to correct the infected source files by comparing to paper listings and repair- 
ing the code by hand! Had the checksums been calculated and checked by the 
application running on the end systems, safety would have been assured. 

There is a useful role for intermediate checks, however, provided that end-to- 
end checking is available. End-to-end checking may show that something is broken 
between two nodes, but it doesn't point to where the problem is. Intermediate 
checks can discover what is broken. You need both for repair. 

Pitfall: Moving functions from the CPU to the I/O processor, expecting to improve 
performance without a careful analysis. 

There are many examples of this pitfall trapping people, although I/O processors, 
when properly used, can certainly enhance performance. A frequent instance of 
this fallacy is the use of intelligent I/O interfaces, which, because of the higher 
overhead to set up an I/O request, can turn out to have worse latency than a pro- 
cessor-directed I/O activity (although if the processor is freed up sufficiently, sys- 
tem throughput may still increase). Frequently, performance falls when the I/O 
processor has much lower performance than the main processor. Consequently, a 
small amount of main processor time is replaced with a larger amount of I/O pro- 
cessor time. Workstation designers have seen both these phenomena repeatedly. 

Myer and Sutherland ( 1968] wrote a classic paper on the trade-off of complex- 
ity and performance in I/O controllers. Borrowing the religious concept of the 
"wheel of reincarnation," they eventually noticed they were caught in a loop of 
continuously increasing the power of an I/O processor until it needed its own sim- 
pler coprocessor: 

We approached the task by starting with a simple scheme and then adding com- 
mands and features that we felt would enhance the power of the machine. 

Gradually the [display] processor became more complex Finally the display 

processor came to resemble a full-fledged computer with some special graphics 
features. And then a strange thing happened. We felt compelled to add to the 
processor a second, subsidiary processor, which, itself, began to grow in com- 
plexity. It was then that we discovered the disturbing truth. Designing a display 
processor can become a never-ending cyclical process. In fact, we found the pro- 
cess so frustrating that we have come to call it the "wheel of reincarnation." 
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I/O systems are evaluated on several different characteristics: dependability; the 
variety of I/O devices supported; the maximum number of I/O devices; cost; and 
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performance, measured both in latency and in throughput. These goals lead to 
widely varying schemes for interfacing I/O devices. In the low-end and midrange 
systems, buffered DMA is likely to be the dominant transfer mechanism. In the 
high-end systems, latency and bandwidth may both be important, and cost may 
be secondary. Multiple paths to I/O devices with limited buffering often charac- 
terize high-end I/O systems. Typically, being able to access the data on an I/O 
device at any time (high availability) becomes more important as systems grow. As 
a result, redundancy and error correction mechanisms become more and more 
prevalent as we enlarge the system. 

Storage and networking demands are growing at unprecedented rates, in part 
because of increasing demands for all information to be at your fingertips. One 
estimate is that the amount of information created in 2002 was 5 exabytes — 
equivalent to 500,000 copies of the text in the U.S. Library of Congress — and that 
the total amount of information in the world doubled in the last three years 
(Lyman and Varian 2003]. 

Future directions of I/O include expanding the reach of wired and wireless net- 
works, with nearly every device potentially having an IP address, and the continu- 
ing transformation from parallel buses to serial networks and switches. However, 
consolidation in the disk industry may lead to a slowdown in improvement in disk 
capacity to earlier rates, which have doubled every year between 2000 and 2004. 



Understanding 

Program 

Performance 



The performance of an I/O system, whether measured by bandwidth or latency, 
depends on all the elements in the path between the device and memory, includ- 
ing the operating system that generates the I/O commands. The bandwidth of the 
buses, the memory, and the device determine the maximum transfer rate from or 
to the device. Similarly, the latency depends on the device latency, together with 
any latency imposed by the memory system or buses. The effective bandwidth and 
response latency also depend on other I/O requests that may cause contention for 
some resource in the path. Finally, the operating system is a bottleneck. In some 
cases, the OS takes a long time to deliver an I/O request from a user program to an 
I/O device, leading to high latency. In other cases, the operating system effectively 
limits the I/O bandwidth because of limitations in the number of concurrent I/O 
operations it can support. 

Keep in mind that while performance can help sell an I/O system, users over- 
whelmingly demand dependability and capacity from their I/O systems. 
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Historical Perspective and Further 
Reading 

The history of I/O systems is a fascinating one. This ® Section 8.1 1 gives a brief 
history of magnetic disks, RAID, databases, the Internet, the World Wide Web, 
and how Ethernet continues to triumph over its challengers. 
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8.1 [10] <§§8.1— 8.2> Here are two different I/O systems intended for use in 
transaction processing: 

■ System A can support 1500 I/O operations per second. 

■ System B can support 1000 I/O operations per second. 

The systems use the same processor that executes 500 million instructions per sec- 
ond. Assume that each transaction requires 5 I/O operations and that each I/O 
operation requires 10,000 instructions. Ignoring response time and assuming that 
transactions may be arbitrarily overlapped, what is the maximum transaction- 
per-second rate that each machine can sustain? 

8.2 [15] <§§8. 1-8. 2> The latency of an I/O operation for the two systems in Exer- 
cise 8.1 differs. The latency for an I/O on system A is equal to 20 ms, while for sys- 
tem B the latency is 18 ms for the first 500 I/Os per second and 25 ms per I/O for 
each I/O between 500 and 1000 I/Os per second. In the workload, every 10th trans- 
action depends on the immediately preceding transaction and must wait for its 
completion. What is the maximum transaction rate that still allows every transac- 
tion to complete in 1 second and that does not exceed the I/O bandwidth of the 
machine? (For simplicity, assume that all transaction requests arrive at the begin- 
ning of a 1-second interval.) 

8.3 [5] <§§8.1-8.2> Suppose we want to use a laptop to send 100 files of approx- 
imately 40 MB each to another computer over a 5 Mbit/sec wireless connection. 
The laptop battery currently holds 100,000 oules of energy. The wireless network- 
ing card alone consumes 5 watts while transmitting, while the rest of the laptop 
always consumes 35 watts. Before each file transfer we need 10 seconds to choose 
which file to send. How many complete files can we transfer before the laptop's 
battery runs down to zero? 
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8.4 (10] <§§8.1-8.2> Consider the laptop's hard disk power consumption in 
Exercise 8.3. Assume that it is no longer constant, but varies between 6 watts when 
it is spinning and 1 watt when it is not spinning. The power consumed by the lap- 
top apart from the hard disk and wireless card is a constant 32 watts. Suppose that 
the hard disk's transfer rate is 50 MB/sec, its delay before it can begin transfer is 20 
ms, and at all other times it does not spin. How many complete files can we transfer 
before the laptop's battery runs down to zero? How much energy would we need 
to send all 100 files? (Consider that the wireless card cannot send data until it is in 
memory.) 



8.5 [5] <§8.3> The following simplified diagram shows two potential ways of 
numbering the sectors of data on a disk (only two tracks are shown and each track 
has eight sectors). Assuming that typical reads are contiguous (e.g., all 16 sectors 
are read in order), which way of numbering the sectors will be likely to result in 
higher performance? Why? 









8.6 [20] <§8.3> In this exercise, we will run a program to evaluate the behavior of 
a disk drive. Disk sectors are addressed sequentially within a track, tracks sequen- 
tially within cylinders, and cylinders sequentially within the disk. Determining 
head switch time and cylinder switch time is difficult because of rotational effects. 
Even determining platter count, sectors/track, and rotational delay is difficult 
based on observation of typical disk workloads. 

The key is to factor out disk rotational effects by making consecutive seeks to indi- 
vidual sectors with addresses that differ by a linearly increasing amount starting 
with 0, 1, 2, and so forth. The Skippy algorithm, from work by Nisha Talagala and 
colleagues of U.C. Berkeley (2000], is 
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fd = open("raw disk device"); 

for (i = 0; i < measurements; i++) { 

//time the following sequence, and output <i , time> 
lseek(fd, i * SINGLE_SECT0R, SEEK_CUR); 
write(fd, buffer, SINGLE_SECT0R) ; 

I 
close(fd) ; 

The basic algorithm skips through the disk, increasing the distance of the seek by 
one sector before every write, and outputs the distance and time for each write. 
The raw device interface is used to avoid file system optimizations. SINGLE_ 
SECTOR is the size of a single sector in bytes. The SEEK_CUR argument to 1 seek 
moves the file pointer an amount relative to the current pointer. A technical 
report describing Skippy and two other disk drive benchmarks (run in seconds or 
minutes rather than hours or days) is at http://sunsite.berkeley.edu/Dienst /UI/2.0/ 
Describe/ncstrlucb/CSD-99-1063. 

Run the Skippy algorithm on a disk drive of your choosing. 

a. What is the number of heads? 

b. The number of platters? 

c. What is the rotational latency? 

d. What is the head switch time (the time to switch the head that is reading 
from one disk surface to another without moving the arm; that is, in the 
same cylinder)? 

e. What is the cylinder switch time? (It is the time to move the arm to the next 
sequential cylinder.) 

8.7 [20] <§8.3> Figure 8.17 shows the output from running the benchmark 
Skippy on a disk. 

a. What is the number of heads? 

b. The number of platters? 

c. What is the rotational latency? 

d. What is the head switch time (the time to switch the head that is reading 
from one disk surface to another without moving the arm; that is, in the 
same cylinder)? 

e. What is the cylinder switch time (the time to move the arm to the next 
sequential cylinder)? 
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50 



100 150 

Distance (sectors) 



200 



FIGURE 8.17 Example output of Skippy for a hypothetical disk. 



8.8 ( 10] <§8.3> Consider two RAID disk systems that are meant to store 10 ter- 
abytes of data (not counting any redundancy). System A uses RAID 1 technology, 
and System B uses RAID 5 technology with four disks in a "protection group." 

a. How many more terabytes of storage are needed in System A than in System 
B? 

b. Suppose an application writes one block of data to the disk. If reading or 
writing a block takes 30 ms, how much time will the write take on System A 
in the worst case? How about on System B in the worst case? 

c. Is System A more reliable that System B? Why or why not? 

8.9 ( 15] <§8.3> What can happen to a RAID 5 system if the power fails between 
the write update to the data block and the write update to the check block so that 
only one of the two is successfully written? What could be done to prevent this 
from happening? 

8.10 [5] <§8.3> The speed of light is approximately 3 X 10 meters per second, 
and electrical signals travel at about 50% of this speed in a conductor. When the 
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term high speed is applied to a network, it is the bandwidth that is higher, not nec- 
essarily the velocity of the electrical signals. How much of a factor is the actual 
"flight time" for the electrical signals? Consider two computers that are 20 meters 
apart and two computers that are 2000 kilometers apart. Compare your results to 
the latencies reported in the example on page 8.3-7 in @ Section 8.3. 

8.11 1 5 1 <§8.3> The number of bytes in transit on a network is defined as the 
flight time (described in Exercise 8.10) multiplied by the delivered bandwidth. Cal- 
culate the number of bytes in transit for the two networks described in Exercise 
8.10, assuming a delivered bandwidth of 6 MB/sec. 

8.12 [5] <§8.3> A secret agency simultaneously monitors 100 cellular phone con- 
versations and multiplexes the data onto a network with a bandwidth of 5 MB/sec 
and an overhead latency of 150 (.is per 1 KB message. Calculate the transmission 
time per message and determine whether there is sufficient bandwidth to support 
this application. Assume that the phone conversation data consists of 2 bytes sam- 
pled at a rate of 4 KHz. 

8.13 [5] <§8.3> Wireless networking has a much higher bit error rate (BER) than 
wired networking. One way to cope with a higher BER is to use an error correcting 
code (ECC) on the transmitted data. A very simple ECC is to triplicate each bit, 
encoding each zero as 000 and each one as 111. When an encoded 3-bit pattern is 
received, the system chooses the most likely original bit. 

a. If the system received 001, what is the most likely value of the original bit? 

b. If 000 was sent but a double-bit error causes it to be received as 110, what 
will the receiver believe was the original bit's value? 

c. How many bit errors can this simple ECC correct? 

d. How many bit errors can this ECC detect? 

e. If 1 out of every 100 bits sent over the network is incorrect, what percentage 
of bit errors would a receiver using this ECC not detect? 

8.14 1 5 1 <§8.3> There are two types of parity: even and odd. A binary word with 
even parity and no errors will have an even number of Is in it, while a word with 
odd parity and no errors will have an odd number of l's in it. Compute the parity 
bit for each of the following 8-bit words if even parity is used: 

a. 01100111 

b. 01010101 

8.15 (101 <§8.3> 



a. If a system uses even parity, and the word 01 1 1 is read from the disk, can we 
tell if there is a single-bit error? 
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b. If a system uses odd parity, and the word 0101 appears on the processor- 
memory bus, we suspect that a single-bit error has occurred. Can we tell 
which bit the error occurs in? Why or why not? 

c. If a system uses even parity and the word 0101 appears on the processor- 
memory bus, can we tell if there is a double-bit error? 

8.16 1 1 1 <§8.3> A program repeatedly performs a three-step process: It reads in 
a 4 KB block of data from disk, does some processing on that data, and then writes 
out the result as another 4 KB block elsewhere on the disk. Each block is contiguous 
and randomly located on a single track on the disk. The disk drive rotates at 10,000 
RPM, has an average seek time of 8 ms, and has a transfer rate of 50 MB/sec. The 
controller overhead is 2 ms. No other program is using the disk or processor, and 
there is no overlapping of disk operation with processing. The processing step 
takes 20 million clock cycles, and the clock rate is 5 GHz. What is the overall speed 
of the system in blocks processed per second? 

8.17 |5) <§8.4> TheOSI network protocol is a hierarchy of layers of abstraction, 
creating an interface between network applications and the physical wires. This is 
similar to the levels of abstraction used in the ISA interface between software and 
hardware. Name three advantages to using abstraction in network protocol design. 

8.18 [5] <§§8.3, 8.5> Suppose we have a system with the following characteris- 
tics: 

1. A memory and bus system supporting block access of 4 to 16 32-bit words. 

2. A 64-bit synchronous bus clocked at 200 MHz, with each 64-bit transfer 
taking 1 clock cycle, and 1 clock cycle required to send an address to mem- 
ory. 

3. Two clock cycles needed between each bus operation. (Assume the bus is 
idle before an access.) 

4. A memory access time for the first four words of 200 ns; each additional set 
of four words can be read in 20 ns. 

Assume that the bus and memory systems described above are used to handle disk 
accesses from disks like the one described in the example on page 570. If the I/O is 
allowed to consume 100% of the bus and memory bandwidth, what is the maxi- 
mum number of simultaneous disk transfers that can be sustained for the two 
block sizes? 

8.19 [5] <§8.5> In the system described in Exercise 8.18, the memoiy system 
took 200 ns to read the first four words, and each additional four words required 
20 ns. Assuming that the memory system takes 150 ns to read the first four words 
and 30 ns to read each additional four words, find the sustained bandwidth and the 
latency for a read of 256 words for transfers that use 4-word blocks and for trans- 
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fers that use 16-word blocks. Also compute the effective number of bus transac- 
tions per second for each case. 

8.20 [5] <§8.5> Exercise 8.19 demonstrates that using larger block sizes results in 
an increase in the maximum sustained bandwidth that can be achieved. Under 
what conditions might a designer tend to favor smaller block sizes? Specifically, 
why would a designer choose a block size of 4 instead of 16 (assuming all of the 
characteristics are as identified in Exercise 8.19)? 

8.21 (15] <§8.5> This question examines in more detail how increasing the block 
size for bus transactions decreases the total latency required and increases the max- 
imum sustainable bandwidth. In Exercise 8.19, two different block sizes are con- 
sidered (4 words and 16 words). Compute the total latency and the maximum 
bandwidth for all of the possible block sizes (between 4 and 16) and plot your 
results. Summarize what you learn by looking at your graph. 

8.22 [15] <§8.5> This exercise is similar to Exercise 8.21. This time fix the block 
size at 4 and 16 (as in Exercise 8.19), but compute latencies and bandwidths for 
reads of different sizes. Specifically, consider reads of from 4 to 256 words, and use 
as many data points as you need to construct a meaningful graph. Use your graph 
to help determine at what point block sizes of 16 result in a reduced latency when 
compared with block sizes of 4. 

8.23 (10] <§8.5> This exercise examines a design alternative to the system 
described in Exercise 8.18 that may improve the performance of writes. For writes, 
assume all of the characteristics reported in Exercise 8.18 as well as the following: 

The first 4 words are written 200 ns after the address is available, and each 
new write takes 20 ns. Assume a bus transfer of the most recent data to 
write, and a write of the previous 4 words can be overlapped. 

The performance analysis reported in the example would thus remain unchanged 
for writes (in actuality, some minor changes might exist due to the need to com- 
pute error correction codes, etc., but we'll ignore this). An alternative bus scheme 
relies on separate 32-bit address and data lines. This will permit an address and 
data to be transmitted in the same cycle. For this bus alternative, what will the 
latency of the entire 256-word transfer be? What is the sustained bandwidth? Con- 
sider block sizes of 4 and 8 words. When do you think the alternative scheme 
would be heavily favored? 

8.24 <20> <§8.5> Consider an asynchronous bus used to interface an I/O 
device to the memory system described in Exercise 8.18. Each I/O request asks 
for 16 words of data from the memory, which, along with the I/O device, has a 
4-word bus. Assume the same type of handshaking protocol as appears in Figure 
8.10 on page 584 except that it is extended so that the memory can continue the 
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transaction by sending additional blocks of data until the transaction is com- 
plete. Modify Figure 8.10 (both the steps and diagram) to indicate how such a 
transfer might take place. Assuming that each handshaking step takes 20 ns and 
memory access takes 60 ns, how long does it take to complete a transfer? What is 
the maximum sustained bandwidth for this asynchronous bus, and how does it 
compare to the synchronous bus in the example? 

8.25 [ 1 day-1 week] <§§8.2-8.5> @ For More Practice: Writing Code to Bench- 
mark I/O Performance 

8.26 [3 days-1 week] <§§8.3-8.5> gg In More Depth: Ethernet Simulation 

8.27 [15] <§8.5> We want to compare the maximum bandwidth for a synchro- 
nous and an asynchronous bus. The synchronous bus has a clock cycle time of 
50 ns, and each bus transmission takes 1 clock cycle. The asynchronous bus requires 
40 ns per handshake. The data portion of both buses is 32 bits wide. Find the band- 
width for each bus when performing one-word reads from a 200-ns memory. 

8.28 [20] <§8.5> Suppose we have a system with the following characteristics: 

1. A memory and bus system supporting block access of 4 to 16 32-bit words. 

2. A 64-bit synchronous bus clocked at 200 MHz, with each 64-bit transfer 
taking 1 clock cycle, and 1 clock cycle required to send an address to mem- 
ory. 

3. Two clock cycles needed between each bus operation. (Assume the bus is 
idle before an access.) 

4. A memory access time for the first four words of 200 ns; each additional set 
of four words can be read in 20 ns. Assume that a bus transfer of the most 
recently read data and a read of the next four words can be overlapped. 

Find the sustained bandwidth and the latency for a read of 256 words for transfers 
that use 4-word blocks and for transfers that use 16-word blocks. Also compute 
the effective number of bus transactions per second for each case. Recall that a 
single bus transaction consists of an address transmission followed by data. 

8.29 [10] <§8.5> Let's determine the impact of polling overhead for three differ- 
ent devices. Assume that the number of clock cycles for a polling operation — 
including transferring to the polling routine, accessing the device, and restarting 
the user program — is 400 and that the processor executes with a 500-MHz clock. 

Determine the fraction of CPU time consumed for the following three cases, 
assuming that you poll often enough so that no data is ever lost and assuming that 
the devices are potentially always busy: 

1. The mouse must be polled 30 times per second to ensure that we do not 
miss any movement made by the user. 
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2. The floppy disk transfers data to the processor in 16-bit units and has a data 
rate of 50 KB/sec. No data transfer can be missed. 

3. The hard disk transfers data in four-word chunks and can transfer at 
4 MB/sec. Again, no transfer can be missed. 

8.30 (15] <§§8.3-8.6> For the I/O system described in Exercise 8.45, find the 
maximum instantaneous bandwidth at which data can be transferred from disk to 
memory using as many disks as needed. How many disks and I/O buses (the min- 
imum of each) do you need to achieve the bandwidth? Since you need only achieve 
this bandwidth for an instant, latencies need not be considered. 

8.31 (20] <§§8.3-8.6> @ In More Depth: Disk Arrays versus Single Disk 

8.32 ( 10] <§§8.3-8.6> ® In More Depth: Disk Arrays Bandwidth 

8.33 (5] <§8.6> Suppose you are designing a microprocessor that uses special 
instructions to access I/O devices (instead of mapping the devices to memory 
addresses). What special instructions would you need to include? What additional 
bus lines would you need this microprocessor to support in order to address I/O 
devices? 

8.34 <§8.6> An important advantage of interrupts over polling is the ability of 
the processor to perform other tasks while waiting for communication from an I/O 
device. Suppose that a 1 GHz processor needs to read 1 000 bytes of data from a par- 
ticular I/O device. The I/O device supplies 1 byte of data every 0.02 ms. The code 
to process the data and store it in a buffer takes 1000 cycles. 

a. If the processor detects that a byte of data is ready through polling, and a 
polling iteration takes 60 cycles, how many cycles does the entire operation 
take? 

b. If instead, the processor is interrupted when a byte is ready, and the proces- 
sor spends the time between interrupts on another task, how many cycles of 
this other task can the processor complete while the I/O communication is 
taking place? The overhead for handling an interrupt is 200 cycles. 

8.35 (20] <§§8.3-8.6> ® For More Practice: Finding I/O Bandwidth Bottlenecks 

8.36 (15] <§§8.3-8.6> © For More Practice: Finding I/O Bandwidth Bottlenecks 

8.37 (15] <§§7.3, 7.5, 8.5, 8.6> gjg For More Practice: I/O System Operation 

8.38 ( 10] <§8.6> Write a paragraph identifying some of the simplifying assump- 
tions made in the analysis below: 

Suppose we have a processor that executes with a 500-MHz clock and the number 
of clock cycles for a polling operation — including transferring to the polling rou- 
tine, accessing the devise, and restarting the user program — is 400. The hard disk 
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transfers data in four-word chunks and can transfer at 4 MB/sec. Assume that you 
poll often enough that no data is ever lost and assume that the hard disk is poten- 
tially always busy. The initial setup of a DMA transfer takes 1000 clock cycles for 
the processor, and the handling of the interrupt at DMA completion requires 500 
clock cycles for the processor. The hard disk has a transfer rate of 4 MB/sec and 
uses DMA. Ignore any impact from bus contention between the processor and the 
DMA controller. Therefore, if the average transfer from the disk is 8 KB, the frac- 
tion of the 500-MHz processor consumed if the disk is actively transferring 100% 
of the time is 0.2%. 

8-39 [8] <§8.6> Suppose we have the same hard disk and processor we used in 
Exercise 8.18, but we use interrupt-driven I/O. The overhead for each transfer, 
including the interrupt, is 500 clock cycles. Find the fraction of the processor con- 
sumed if the hard disk is only transferring data 5% of the time. 

8.40 [8] <§8.6> Suppose we have the same processor and hard disk as in Exercise 
8.18. Assume that the initial setup of a DMA transfer takes 1000 clock cycles for the 
processor, and assume the handling of the interrupt at DMA completion requires 
500 clock cycles for the processor. The hard disk has a transfer rate of 4 MB/sec and 
uses DMA. If the average transfer from the disk is 8 KB, what fraction of the 500- 
MHz processor is consumed if the disk is actively transferring 100% of the time? 
Ignore any impact from bus contention between the processor and DMA control- 
ler. 

8.41 1 2 days-1 week] <§8.6, Appendix A> @ For More Practice: Using SPIM to 
Explore I/O 

8.42 [3 days-1 week] <§8.6, Appendix A> @ For More Practice: Writing Code 
to Perform I/O 

8.43 [3 days-1 week] <§8.6, Appendix A> ® For More Practice: Writing Code 
to Perform I/O 

8.44 1 1 3 1 < §§8.3-8. 7> Redo the example on page 601, but instead assume that 
the reads are random 8-KB reads. You can assume that the reads are always to an 
idle disk, if one is available. 

8.45 [20] <§§8.3-8.7> Here are a variety of building blocks used in an I/O system 
that has a synchronous processor-memory bus running at 800 MHz and one or 
more I/O adapters that interface I/O buses to the processor-memory bus. 

■ Memory system: The memory system has a 32-bit interface and handles 
four-word transfers. The memory system has separate address and data lines 
and, for writes to memory, accepts a word every clock cycle for 4 clock cycles 
and then takes an additional 4 clock cycles before the words have been 
stored and it can accept another transaction. 
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■ DMA interfaces: The I/O adapters use DMA to transfer the data between the 
I/O buses and the processor-memory bus. The DMA unit arbitrates for the 
processor-memory bus and sends/receives four-word blocks from/to the 
memory system. The DMA controller can accommodate up to eight disks. 
Initiating a new I/O operation (including the seek and access) takes 0.1 ms, 
during which another I/O cannot be initiated by this controller (but out- 
standing operations can be handled). 

■ I/O bus: The I/O bus is a synchronous bus with a sustainable bandwidth of 
100 MB/sec; each transfer is one word long. 

■ Disks: The disks have a measured average seek plus rotational latency of 8 
ms. The disks have a read/write bandwidth of 40 MB/sec, when they are 
transferring. 

Find the time required to read a 16 KB sector from a disk to memory, assuming 
that this is the only activity on the bus. 



8.46 [5] <§8.7> In order to perform a disk or network access, it is typically nec- 
essary for the user to have the operating system communicate with the disk or net- 
work controllers. Suppose that in a particular 5 GHz computer, it takes 10,000 
cycles to trap to the OS, 20 ms for the OS to perform a disk access, and 25 (js for 
the OS to perform a network access. In a disk access, what percentage of the delay 
time is spent in trapping to the OS? How about in a network access? 



8.47 [5] <§8.7> Suppose that in the computer in Exercise 8.46 we can somehow 
reduce the time for the OS to communicate with the disk controller by 60%, and 
we can reduce the time for the OS to communicate with the network by 40%. By 
what percentage can we reduce the total time for a network access? By what per- 
centage can we reduce the total time for a disk access? Is it worthwhile for us to 
spend a lot of effort improving the OS trap latency in a computer that performs 
many disk accesses? How about in a computer that performs many network 
accesses? 



§8.2, Page 580: Dependability: 2 and 3. RAID: All are true. 
§8.3, Page 8.3-10: 1. 
§8.4, Page 587: 1 and 2. 
§8.5, Page 597: 1 and 2. 
§8.6, Page 600: 1 and 2. 
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Saving Lives through 
Better Diagnosis 



Problem: Find a way to examine internal 
organs to diagnose psychological problems 
without the use of invasive surgery or harmful 
radiation. 

Solution: The development of magnetic res- 
onance imaging (MRI), a three-dimensional 
scanning technology, has been one of the most 
important breakthroughs in modern medical 
technology. MRI uses a combination of radio- 
frequency pulses and magnetic fields to scan 
tissue. The organ to be imaged is scanned in a 
series of two-dimensional slices, which are 
then composed to create a three-dimensional 
image. 

In addition to this computationally inten- 
sive task of composing the slices to create a 
volumetric image, extensive computation is 
used to extract the initial two-dimensional 
images, since the signal-to-noise ratio is often 



low. The development of MRI has allowed the 
scanning of soft tissues, such as the brain, for 
which X-rays are not as effective and explor- 
atory surgery is dangerous. Without a cost- 
effective computing capability, MRI would 
remain slow and expensive. 

The two illustrations shows a series of MRI 
images of the human brain; the images below 
represent two-dimensional slices, while those on 
the facing page show a three-dimensional recon- 
struction. Once an image is in digital form, a 
physician can manipulate the image, removing 
outer layers, examining the image from different 
viewpoints, or looking at the three-dimensional 
structure to help in diagnosis. 

The major benefits of MRI are twofold: 

■ It can reduce the need for unnecessary 
exploratory surgery. A physician may be 
able to determine whether a patient ex- 




MRI images of a human brain, in two-dimensional view 



periencing headaches has a brain tumor, 
which requires surgery, or simply needs 
medication for a headache. 

■ By providing a surgeon with an accurate 
three-dimensional image, MRI can im- 
prove the surgical planning process and 
hence the outcome. For example, in oper- 
ating on the brain to remove a tumor with- 
out accurate images of the tumor, the 
surgeon likely would have to enter the 
brain and then create a plan on the fly de- 
pending on the size and exact placement of 
the tumor. Furthermore, minimally inva- 
sive techniques (e.g. endoscopic surgery) , 
which have become quite effective, would 
be impossible without accurate images. 

There are many new interesting uses of MRI 
technology, which rely on faster and more cost 
effective computing. Some of the most prom- 
ising are 

■ real-time imaging of the heart and blood 
vessels to enhance diagnosis of cardiac 
and cardiovascular disease; 

■ Combining real-time images and MRI 
images during surgery to help surgeons 



accurately perform surgery, particularly 
when using minimally invasive tech- 
niques. 

■ FunctionalMRI(FMRI):anewtypeof 
application that uses MRI to examine 
brain function, primarily by analyzing 
blood flow in various portions of the 
brain. FMRI is being used for a number of 
applications, including exploring the 
physiological bases for cognitive problems 
such as dyslexia, pain management, plan- 
ning for neurosurgery, and understanding 
neurological disorders. 

To learn more see these references on 
the @ library 



MRI scans from the National Institutes of Health's Visi- 
ble Human project 

Principles of MRI and its application to medical imag- 
ing (long and reasonably detailed, but only a little 
mathematics) 

Using MRI to do real-time cardiac imaging and angiog- 
raphy (imaging of blood vessels) 

Functional MRI> www.fmri.org/rniri.htm 

Visualization and imaging (including MRI and CT 
images): high-performance computing for complex 
images 




MRI images of a human brain in three dimensions 
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that combines local behavior of a particular 


patterning steps that can result in the failure 




branch and global information about the 


of the die containing that defect. 




behavior of some recent number of execut- 


delayed branch A type of branch where the 




ed branches. 


instruction immediately following the 




CPU execution tune Also called CPU time. 


branch is always executed, independent of 




The actual time the CPU spends computing 


whether the branch condition is true or false. 




for a specific task. 


desktop computer A computer designed 




crossbar network A network that allows 


for use by an individual, usually incorporat- 




any node to communicate with any other 


ing a graphics display, keyboard, and 




node in one pass through the network. 


mouse. 
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die The individual rectangular sections 


don't-care term An element of a logical 




that are cut from a wafer, more informally 


function in which the output does not de- 




known as chips. 


pend on the values of all the inputs. Don't- 




DIMM (dual inline memory module) A 


care terms may be specified in different ways. 




small board that contains DRAM chips on 


double precision A floating-point value 




both sides. SIMMs have DRAMs on only 


represented in two 32-bit words. 




one side. Both DIMMs and SIMMs are 


dynamic branch prediction Prediction of 




meant to be plugged into memory slots. 


branches at runtime using runtime infor- 




usually on a motherboard. 


mation. 




direct memory access (DMA) A mechanism 


dynamic multiple issue An approach to 




that provides a device controller the ability to 


implementing a multiple-issue processor 




transfer data directly to or from the memory 


where many decisions are made during exe- 




without involving the processor. 


cution by the processor. 




direct-mapped cache A cache structure in 


dynamic pipeline scheduling Hardware 




which each memory location is mapped to 


support for reordering the order of instruc- 




exactly one location in the cache. 


tion execution so as to avoid stalls. 




directory A repository for information on 


dynamic random access memory 




the state of every block in main memory, in- 


(DRAM) Memory built as an integrated 




cluding which caches have copies of the 


circuit, it provides random access to any 




block, whether it is dirty, and so on. Used 


location. 




for cache coherence. 


edge-triggered clocking A clocking 




dispatch An operation in a micropro- 


scheme in which all state changes occur on 




grammed control unit in which the next mi- 


a clock edge- 




croinstruction is selected on the basis of one or 


embedded computer A computer inside 




more fields of a macroinstruction, usually by 


another device used for running one pre- 




creating a table containing the addresses of the 


determined application or collection of 




target microinstructions and indexing the ta- 


software. 




ble using a field of the macroinstruction. The 


error-detecting code A code that enables 




dispatch tables are typically implemented in 


the detection of an error in data, but not the 




ROM or programmable logic array (PLA). 


precise location, and hence correction of the 




The term dispatch is also used in dynamically 


error. 




scheduled processors to refer to the process of 


Ethernet A computer network whose 




sending an instruction to a queue. 


length is limited to about a kilometer. Orig- 




distributed memory Physical memory that 


inally capable of transferring up to 10 mil- 




is divided into modules, with some placed 


lion bits per second, newer versions can run 




near each processor in a multiprocessor. 


up to 100 million bits per second and even 




distributed shared memory (DSM) A 


1000 million bits per second. It treats the 




memory scheme that uses addresses to access 


wire like a bus with multiple masters and 




remote data when demanded rather than re- 


uses collision detection and a back-off 




trieving the data in case it might be used. 


scheme for handling simultaneous accesses. 




dividend A number being divided. 


exception Also called interrupt. An un- 




divisor A number that the dividend is di- 


scheduled event that disrupts program exe- 




vided by. 


cution; used to detect overflow. 
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exception enable Also called interrupt en- 


flip-flop A memory element for which the 




able. A signal or action that controls wheth- 


output is equal to the value of the stored state 




er the process responds to an exception or 


inside the element and for which the internal 




not; necessary for preventing the occur- 


state is changed only on a clock edge. 




rence of exceptions during intervals before 


floating point Computer arithmetic that 




the processor has safely saved the state 


represents numbers in which the binary 




needed to restart. 


point is not fixed. 




executable file A functional program in the 


floppy disk A portable form of secondary 




format of an object file that contains no un- 


memory composed of a rotating mylar plat- 




resolved references, relocation information. 


ter coated with a magnetic recording 




symbol table, or debugging information. 


material. 




exponent In the numerical representation 


flush (instructions) To discard instruc- 




system of floating-point arithmetic, the val- 


tions in a pipeline, usually due to an unex- 




ue that is placed in the exponent field. 


pected event. 




external label Also called global label. A la- 


formal parameter A variable that is the ar- 




bel referring to an object that can be refer- 


gument to a procedure or macro; replaced by 




enced from files other than the one in which 


that argument once the macro is expanded. 




it is defined. 


forward reference A label that is used be- 




false sharing A sharing situation in which 


fore it is defined. 




two unrelated shared variables are located in 


forwarding Also called bypassing. A meth- 




the same cache block and the full block is ex- 


od of resolving a data hazard by retrieving the 




changed between processors even though the 


missing data element from internal buffers 




processors are accessing different variables. 


rather than waiting for it to arrive from 




field programmable devices (FPD) An in- 


programmer -visible registers or memory. 




tegrated circuit containing combinational 


fraction The value, generally between 




logic, and possibly memory devices, that is 


and 1, placed in the fraction field. 




configurable by the end user. 


frame pointer A value denoting the loca- 




field programmable gate array A config- 


tion of the saved registers and local variables 




urable integrated circuit containing both 


for a given procedure. 




combinational logic blocks and flip-flops. 


fully associative cache A cache structure in 




finite state machine A sequential logic 


which a block can be placed in any location 




function consisting of a set of inputs and 


in the cache. 




outputs, a next-state function that maps the 


fully connected network A network that 




current state and the inputs to a new state, 


connects processor-memory nodes by sup- 




and an output function that maps the 


plying a dedicated communication link be- 




current state and possibly the inputs to a set 


tween every node. 




of asserted outputs. 


gate A device that implements basic logic 




firmware Microcode implemented in a 


functions, such as AND or OR. 




memory structure, typically ROM or RAM. 


general-purpose register (GPR) A register 




flat panel display, liquid crystal display A 


that can be used for addresses or for data 




display technology using a thin layer of liquid 


with virtually any instruction. 




polymers that can be used to transmit or block 


global miss rate The fraction of references 




light according to whether a charge is applied. 


that miss in all levels of a multilevel cache. 
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global pointer The register that is reserved 


I/O instructions A dedicated instruction 




to point to static data. 


that is used to give a command to an I/O de- 




guard The first of two extra bits kept on the 


vice and that specifies both the device num- 




right during intermediate calculations of 


ber and the command word (or the location 




floating-point numbers; used to improve 


of the command word in memory). 




rounding accuracy. 


I/O rate Performance measure of I/Os per 




handler Name of a software routine in- 


unit time, such as reads per second. 




voked to "handle" an exception or interrupt. 


I/O requests Reads or writes to I/O devices. 




hand shaking protocol A series of steps used 


implementation Hardware that obeys the 




to coordinate asynchronous bus transfers in 


architecture abstraction. 




which the sender and receiver proceed to the 


imprecise interrupt Also called imprecise 




next step only when both parties agree that 


exception. Interrupts or exceptions in pipe- 




the current step has been completed. 


lined computers that are not associated with 




hardware description language A pro- 


the exact instruction that was the cause of 




gramming language for describing hardware 


the interrupt or exception. 




used for generating simulations of a hard- 


in-order commit A commit in which the 




ware design and also as input to synthesis 


results of pipelined execution are written to 




tools that can generate actual hardware. 


the programmer-visible state in the same 




hardware synthesis tools Computer-aided 


order that instructions are fetched. 




design software that can generate a gate-lev- 


input device A mechanism through which 




el design based on behavioral descriptions 


the computer is fed information, such as the 




of a digital system. 


keyboard or mouse. 




hardwired control An implementation of 


instruction format A form of representa- 




finite state machine control typically using 


tion of an instruction composed of fields of 




programmable logic arrays (PLAs) or col- 


binary numbers. 




lections of PLAs and random logic. 


instruction group In IA-64, a sequence of 




hexadecimal Numbers in base 16. 


consecutive instructions with no register 




high-level programming language A por- 


data dependences among them. 




table language such as C, Fortran, or Java 


instruction latency The inherent execu- 




composed of words and algebraic notation 


tion time for an instruction. 




that can be translated by a compiler into as- 


instruction mix A measure of the dynamic 




sembly language. 


frequency of instructions across one or 




hit rate The fraction of memory accesses 


many programs. 




found in a cache. 


instruction set architecture Also called ar- 




hit time The time required to access a level 


chitecture. An abstract interface between 




of the memory hierarchy, including the 


the hardware and the lowest level software 




time needed to determine whether the ac- 


of a machine that encompasses all the infor- 




cess is a hit or a miss. 


mation necessary to write a machine 




hold time The minimum time during 


language program that will run correctly, 




which the input must be valid after the clock 


including instructions, registers, memory 




edge. 


access, I/O, and so on. 




hot swapping Replacing a hardware com- 


instruction set The vocabulary of com- 




ponent while the system is running. 


mands understood by a given architecture. 
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instruction-level parallelism The parallel- 


latency (pipeline) The number of stages in 




ism among instructions. 


a pipeline or the number of stages between 




integrated circuit Also called chip. A device 


two instructions during execution. 




combining dozens to millions of transistors. 


least recently used (LRU) A replacement 




interrupt An exception that comes from 


scheme in which the block replaced is the one 




outside of the processor. (Some architectures 


that has been unused for the longest time. 




use the term interrupt for all exceptions.) 


least significant bit The rightmost bit in a 




interrupt-driven I/O An I/O scheme that 


MIPS word. 1 




employs interrupts to indicate to the pro- 


level-sensitive clocking A timing method- 




cessor that an I/O device needs attention. 


ology in which state changes occur at either 




interrupt handler A piece of code that is run 


high or low clock levels but are not instanta- 




as a result of an exception or an interrupt. 


neous> as such changes are in edge-triggered 




issue packet The set of instructions that is- 


designs. 




sues together in 1 clock cycle; the packet 


linker Also called link editor. A systems 




maybe determined statically by the compil- 


program that combines independently as- 




er or dynamically by the processor. 


sembled machine language programs and 




issue slots The positions from which in- 


resolves all undefined labels into an execut- 




structions could issue in a given clock cycle; 


able file. 




by analogy these correspond to positions at 


loader A systems program that places an 




the starting blocks for a sprint. 


object program in main memory so that it is 




Java bytecode Instruction from an instruc- 


ready to execute. 




tion set designed to interpret Java programs. 


load-store machine Also called register- 




Java Virtual Machine (JVM) The program 


register machine. An instruction set archi- 




that interprets Java bytecodes. 


tecture in which all operations are between 




jump address table Also called jump table. 


registers and data memory may only be ac- 




A table of addresses of alternative instruc- 


cessed via loads or stores. 




tion sequences. 


load-use data hazard A specific form of 




jump-and-link instruction An instruction 


data hazard in which the data requested by 




that jumps to an address and simultaneous- 


a load instruction has not yet become avail- 




ly saves the address of the following instruc- 


able when it is requested. 




tion in a register ($ ra in MIPS). 


local area network (LAN) A network de- 




Just In Time Compiler (JIT) The name com- 


signed to carry data within a geographically 




monly given to a compiler that operates at 


confined area, typically within a single 




runtime, translating the interpreted code seg- 


building. 




ments into the native code of the computer. 


local label A label referring to an object 




kernel mod e Also called supervisor mode. 


that can be used only within the file in 




A mode indicating that a running process is 


which it is defined. 




an operating system process. 


local miss rate The fraction of references to 




latch A memory element in which the out- 


one level of a cache that miss; used in multi- 




put is equal to the value of the stored state 


level hierarchies. 




inside the element and the state is changed 


lock A synchronization device that allows 




whenever the appropriate inputs change 


access to data to only one processor at a 




and the clock is asserted. 


time. 
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lookup tables (LUTs) In a field program- 


message passing Communicating between 




mable device, the name given to the cells be- 


multiple processors by explicitly sending 




cause they consist of a small amount of logic 


and receiving information. 




and RAM. 


metastability A situation that occurs if a 




loop unrolling A technique to get more 


signal is sampled when it is not stable for the 




performance from loops that access arrays, 


required set-up and hold times, possibly 




in which multiple copies of the loop body 


causing the sampled value to fall in the inde- 




are made and instructions from different it- 


terminate region between a high and low 




erations are scheduled together. 


value. 




machine language Binary representation 


microarchitecture The organization of the 




used for communication within a computer 


processor, including the major functional 




system. 


units, their interconnection, and control. 




macro A pattern-matching and replace- 


microcode The set of microinstructions 




ment facility that provides a simple mecha- 


that control a processor. 




nism to name a frequently used sequence of 


microinstruction A representation of 




instructions. 


control using low-level instructions, each 




magnetic disk (also called hard disk) A 


of which asserts a set of control signals 




form of nonvolatile secondary memory 


that are active on a given clock cycle as 




composed of rotating platters coated with a 


well as specifies what microinstruction to 




magnetic recording material. 


execute next. 




megabyte Traditionally 1,048,576 (2 20 ) 


micro-operations The RISC-like instruc- 




bytes, although some communications and 


tions directly executed by the hardware in 




secondary storage systems have redefined it 


recent Pentium implementations. 




to mean 1,000,000 (10 6 ) bytes. 


microprogram A symbolic representation 




memory The storage area in which pro- 


of control in the form of instructions, called 




grams are kept when they are running and 


microinstructions, that are executed on a 




that contains the data needed by the run- 


simple micromachine. 




ning programs. 


microprogrammed control A method of 




memory hierarchy A structure that uses 


specifying control that uses microcode rath- 




multiple levels of memories; as the dis- 


er than a finite state representation. 




tance from the CPU increases, the size of 


million instructions per second (MIPS) A 




the memories and the access time both 


measurement of program execution speed 




increase. 


based on the number of millions of instruc- 




memory-mapped I/O An I/O scheme in 


tions. MIPS is computed as the instruction 




which portions of address space are as- 


count divided by the product of the execu- 




signed to I/O devices and reads and writes 


tion time and 10 . 




to those addresses are interpreted as com- 


minterms Also called product terms. A set 




mands to the I/O device. 


of logic inputs joined by conjunction (AND 




MESI cache coherency protocol A 


operations); the product terms form the 




write-invalidate protocol whose name is 


first logic stage of the programmable logic 




an acronym for the four states of the pro- 


array (PLA). 




tocol: Modified, Exclusive, Shared, 


mirroring Writing the identical data to 




Invalid. 


multiple disks to increase data availability. 
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miss penalty The time required to fetch a 


nonblocking cache A cache that allows the 




block into a level of the memory hierarchy 


processor to make references to the cache 




from the lower level> including the time to 


while the cache is handling an earlier miss. 




access the block, transmit it from one level 


nonuniform memory access (NUMA) A 




to the other, and insert it in the level that ex- 


type of single-address space multiprocessor 




perienced the miss. 


in which some memory accesses are faster 




miss rate The fraction of memory accesses 


than others depending wh ich processor asks 




not found in a level of the memory hierarchy. 


for which word. 




most significant bit The leftmost bit in a 


nonvolatile memory A form of memory 




MIPS word. 


that retains data even in the absence of a 




motherboard A plastic board containing 


power source and that is used to store pro- 




packages of integrated circuits or chips, in- 


grams between runs. Magnetic disk is non- 




cluding processor, cache, memory, and 


volatile and DRAM is not. 




connectors for I/O devices such as networks 


nonvolatile Storage device where data re- 




and disks. 


tains its value even when power is removed. 




multicomputer Parallel processors with 


nop An instruction that does no operation 




multiple private addresses. 


to change state. 




multicycle implementation Also called 


NOR A logical bit-by-bit operation with 




multiple clock cycle implementation. An 


two operands that calculates the NOT of the 




implementation in which an instruction is 


OR of the two operands. 




executed in multiple clock cycles. 


NOR gate An inverted OR gate. 1 




multilevel cache A memory hierarchy with 


normalized A number in floating-point 




multiple levels of caches, rather than just a 


notation that has no leading Os. 




cache and main memory. 


NOT A logical bit-by-bit operation with 




multiple issue A scheme whereby multiple 


one operand that inverts the bits; that is, 




instructions are launched in 1 clock cycle. 


it replaces every 1 with a 0, and every 




multiprocessor Parallel processors with a 


with a 1. 




single shared address. 


object-oriented language A programming 




multistage network A network that sup- 


language that is oriented around objects 




plies a small switch at each node. 


rather than actions, or data versus logic. 




NAND gate An inverted AND gate. 


opcode The field that denotes the opera- 




network bandwidth Informally, the peak 


tion and format of an instruction. 




transfer rate of a network; can refer to the 


operating system Supervising program 




speed of a single link or the collective trans- 


that manages the resources of a computer 




fer rate of all links in the network. 


for the benefit of the programs that run on 




next-state function A combinational func- 


that machine. 




tion that, given the inputs and the current 


out-of-order execution A situation in 




state, determines the next state of a finite 


pipelined execution when an instruction 




state machine. 


blocked from executing does not cause the 




nonblocking assignment An assignment 


following instructions to wait. 




that continues after evaluating the right-hand 


output device A mechanism that conveys 




side, assigning the left-hand side the value 


the result of a computation to a user or an- 




only after all right-hand sides are evaluated. 


other computer. 
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overflow (floating-point) A situation in 


alway associated with the correct instruc- 




which a positive exponent becomes too 


tion in pipelined computers. 




large to fit in the exponent field. 


predication A technique to make instruc- 




package Basically a directory that contains 


tions dependent on predicates rather than 




a group of related classes. 


on branches. 




page fault An event that occurs when an ac- 


prefetching A technique in which data 




cessed page is not present in main memory. 


blocks needed in the future are brought into 




page table The table containing the virtual 


the cache early by the use of special instruc- 




to physical address translations in a virtual 


tions that specify the address of the block. 




memory system. The table, which is stored in 


primary memory Also called main memo- 




memory, is typically indexed by the virtual 


ry. Volatile memory used to hold programs 




page number; each entry in the table contains 


while they are running; typically consists of 




the physical page number for that virtual 


DRAM in today's computers. 




page if the page is currently in memory. 


procedure A stored subroutine that per- 




parallel processing program A single pro- 


forms a specific task based on the parame- 




gram that runs on multiple processors si- 


ters with which it is provided. 




multaneously. 


procedure call frame A block of memory 




PC-relative addressing An addressing re- 


that is used to hold values passed to a proce- 




gime in which the address is the sum of the 


dure as arguments, to save registers that a pro- 




program counter (PC) and a constant in the 


cedure may modify but that the procedure's 




instruction. 


caller does not want changed, and to provide 




physical address An address in main 


space for variables local to a procedure. 




memory. 


procedure frame Also called activation 




physically addressed cache A cache that is 


record. The segment of the stack contain- 




addressed by a physical address. 


ing a procedure's saved registers and local 




pipeline stall Also called bubble. A stall 


variables. 




initiated in order to resolve a hazard. 


processor-memory bus A bus that con- 




pipelining An implementation technique 


nects processor and memory and that is 




in which multiple instructions are over- 


short, generally high speed, and matched to 




lapped in execution, much like to an assem- 


the memory system so as to maximize 




bly line. 


memory-processor bandwidth. 




pixel The smallest individual picture ele- 


program counter (PC) The register con- 




ment. Screen are composed of hundreds of 


taining the address of the instruction in the 




thousands to millions of pixels, organized in 


program being executed 




a matrix. 


programmable array logic 




poison A result generated when a specula- 


(PAL) Contains a programmable and- 




tive load yields an exception, or an instruc- 


plane followed by a fixed or- plane. 




tion uses a poisoned operand. 


programmable logic array (PLA) A struc- 




polling The process of periodically check- 


tured-logic element composed of a set of in- 




ing the status of an I/O device to determine 


puts and corresponding input complements 




the need to service the device. 


and two stages of logic: the first generating 




precise interrupt Also called precise ex- 


product terms of the inputs and input com- 




ception. An interrupt or exception that is 


plements and the second generating sum 
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terms of the product terms. Hence, PLAs im- 


receive message routine A routine used 




plement logic functions as a sum of products. 


by a processor in machines with private 




programmable logic device (PLD) An in- 


memories to accept a message from anoth- 




tegrated circuit containing combinational 


er processor. 




logic whose function is configured by the 


recursive procedures Procedures that call 




end user. 


themselves either directly or indirectly 




programmable ROM (PROM) A form of 


through a chain of calls. 




read-only memory that can be programmed 


redundant arrays of inexpensive disks 




when a designer knows its contents. 


(RAID) An organization of disks that uses 




propagation time The time required for an 


an array of small and inexpensive disks so as 




input to a flip-flop to propagate to the out- 


to increase both performance and reliability. 




puts of the flip-flop. 


reference bit Also called use bit. A field 




protected A Java keyword that restricts in- 


that is set whenever a page is accessed and 




vocation of a method to other methods in 


that is used to implement LRU or other re- 




that package. 


placement schemes. 




protection A set of mechanisms for ensur- 


reg In Verilog, a register. 




ing that multiple processes sharing the pro- 


register file A state element that consists of a 




cessor, memory, or I/O devices cannot 


set of registers that can be read and written by 




interfere, intentionally or unintentionally, 


supplying a register number to be accessed. 




with one another by reading or writing each 


register renaming The renaming of regis- 




others data. These mechanisms also isolate 


ters, by the compiler or hardware, to re- 




the operating system from a user process. 


move antidependences. 




protection group The group of data disks 


register-use convention Also called proce- 




or blocks that share a common check disk 


dure call convention. A software protocol 




or block. 


governing the use of registers by procedures. 




pseudoinstruction A common variation of 


relocation information The segment of a 




assembly language instructions often treat- 


UNIX object fde that identifies instruc- 




ed as if it were an instruction in its own 


tions and data words that depend on abso- 




right. 


lute addresses. 




public A Java keyword that allows a meth- 


remainder The secondary result of a divi- 




od to be invoked by any other method. 


sion; a number that when added to the 




quotient The primary result of a division; a 


product of the quotient and the divisor pro- 




number that when multiplied by the divisor 


duces the dividend. 




and added to the remainder produces the 


reorder buffer The buffer that holds results 




dividend. 


in a dynamically scheduled processor until 




read-only memory (ROM) A memory 


it is safe to store the results to memory or a 




whose contents are designated at creation 


register. 




time, after which the contents can only be 


reservation station A buffer within a func- 




read. ROM is used as structured logic to im- 


tional unit that holds the operands and the 




plement a set of logic functions by using the 


operation. 




terms in the logic functions as address in- 


response time Also called execution time. 




puts and the outputs as bits in each word of 


The total time required for the computer to 




the memory. 


complete a task, including disk accesses, 
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memory accesses> I/O activities> operating 


send message routine A routine used by a 




system overhead, CPU execution time> and 


processor in machines with private memo- 




so on. 


ries to pass to another processor. 




restartable instruction An instruction that 


sensitivity list The list of signals that spec- 




can resume execution after an exception is 


ifies when an always block should be 




resolved without the exception's affecting 


reevaluated. 




the result of the instruction. 


separate compilation Splitting a program 




return address A link to the calling site that 


across many files, each of which can be 




allows a procedure to return to the proper 


compiled without knowledge of what is in 




address; in MIPS it is stored in register $ r a. 


the other files. 




rotation latency Also called delay. The 


sequential logic A group of logic elements 




time required for the desired sector of a disk 


that contain memory and hence whose val- 




to rotate under the read/write head; usually 


ue depends on the inputs as well as the cur- 




assumed to be half the rotation time. 


rent contents of the memory. 




round Method to make the intermediate 


Server A computer used for running larger 




floating-point result fit the floating-point 


programs for multiple users often simulta- 




format; the goal is typically to find the near- 


neously and typically accessed only via a 




est number that can be represented in the 


network. 




format. 


set-associative cache A cache that has a 




scientific notation A notation that renders 


fixed number of locations (at least two) 




numbers with a single digit to the left of the 


where each block can be placed. 




decimal point. 


set-up time The minimum time that the 




secondary memory Nonvolatile memory 


input to a memory device must be valid be- 




used to store programs and data between 


fore the clock edge. 




runs; typically consists of magnetic disks in 


shared memory A memory for a parallel 




today's computers. 


processor with a single address space, im- 




sector One of the segments that make up a 


plying implicit communication with loads 




track on a magnetic disk; a sector is the 


and stores. 




smallest amount of information that is read 


sign-extend To increase the size of a data 




or written on a disk. 


item by replicating the high-order sign bit 




seek The process of positioning a read/write 


of the original data item in the high-order 




head over the proper track on a disk. 


bits of the larger, destination data item. 




segmentation A variable-size address 


silicon A natural element which is a semi- 




mapping scheme in which an address con- 


conductor. 




sists of two parts: a segment number, which 


silicon crystal ingot A rod composed of a 




is mapped to a physical address, and a seg- 


silicon crystal that is between 6 and 12 inch- 




ment offset. 


es in diameter and about 12 to 24 inches 




selector value Also called control value. 


long. 




The control signal that is used to select one 


simple programmable logic device 




of the input values of a multiplexor as the 


(SPLD) Programmable logic device usually 




output of the multiplexor. 


containing either a single PAL or PLA. 




semiconductor A substance that does not 


single precision A floating-point value 




conduct electricity well. 


represented in a single 32-bit word. 
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single-cycle implementation Also called 


state element A memory element. 




single clock cycle implementation. An im- 


static data The portion of memory that 




plementation in which an instruction is ex- 


contains data whose size is known to the 




ecuted in one clock cycle. 


compiler and whose lifetime is the pro- 




small computer systems interface (SCSI) A 


gram's entire execution. 




bus used as a standard for I/O devices. 


static method A method that applies to the 




snooping cache coherency A method for 


whole class rather to an individual object. It 




maintaining cache coherency in which all 


is unrelated to static in C. 




cache controllers monitor or snoop on the 


static multiple issue An approach to im- 




bus to determine whether or not they have a 


plementing a multiple-issue processor 




copy of the desired block. 


where many decisions are made by the com- 




source language The high-level language 


piler before execution. 




in which a program is originally written. 


static random access memory (SRAM) A 




spatial locality The locality principle stat- 


memory where data is stored statically (as in 




ing that if a data location is referenced, data 


flip-flops) rather than dynamically (as in 




locations with nearby addresses will tend to 


DRAM). SRAMs are faster than DRAMs, 




be referenced soon. 


but less dense and more expensive per bit. 




speculation An approach whereby the 


sticky bit A bit used in rounding in addi- 




compiler or processor guesses the outcome 


tion to guard and round that is set whenever 




of an instruction to remove it as a depen- 


there are nonzero bits to the right of the 




dence in executing other instructions. 


round bit. 




split cache A scheme in which a level of the 


stop In IA-64, an explicit indicator of a 




memory hierarchy is composed of two in- 


break between independent and dependent 




dependent caches that operate in parallel 


instructions. 




with each other with one handling instruc- 


stored -program concept The idea that in- 




tions and one handling data. 


structions and data of many types can be 




split transaction protocol A protocol in 


stored in memory as numbers, leading to 




which the bus is released during a bus trans- 


the stored program computer. 




action while the requester is waiting for the 


striping Allocation of logically sequential 




data to be transmitted, which frees the bus 


blocks to separate disks to allow higher per- 




for access by another requester. 


formance than a single disk can deliver. 




stack pointer A value denoting the most 


structural hazard An occurrence in which a 




recently allocated address in a stack that 


planned instruction cannot execute in the 




shows where registers should be spilled or 


proper clock cycle because the hardware can- 




where old register values can be found. 


not support the combination of instructions 




stack segment The portion of memory 


that are set to execute in the given clock cycle. 




used by a program to hold procedure call 


structural specification Describes how a 




frames. 


digital system is organized in terms of a hi- 




stack A data structure for spilling registers 


erarchical connection of elements. 




organized as a last-in-first-out queue. 


sum of products A form of logical repre- 




standby spares Reserve hardware resourc- 


sentation that employs a logical sum (OR) 




es that can immediately take the place of a 


of products (terms joined using the AND 




failed component. 


operator). 
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supercomputer A class of computers with 


system performance evaluation coopera- 




the highest performance and cost; they are 


tive (SPEC) benchmark A set of standard 




configured as servers and typically cost mil- 


CPU-intensive, integer and floating point 




lions of dollars. 


benchmarks based on real programs. 




superscalar An advanced pipelining tech- 


systems software Software that provides 




nique that enables the processor to execute 


services that are commonly useful, includ- 




more than one instruction per clock cycle. 


ing operating systems, compilers, and as- 




swap space The space on the disk reserved 


semblers. 




for the full virtual memory space of a process. 


tag A field in a table used for a memory hi- 




switched network A network of dedicated 


erarchy that contains the address informa- 




point-to-point links that are connected to 


tion required to identify whether the 




each other with a switch. 


associated block in the hierarchy corre- 




symbol table A table that matches names 


sponds to a requested word. 




of labels to the addresses of the memory 


temporal locality The principle stating 




words that instructions occupy. 


that if a data location is referenced then it 




symmetric multiprocessor (SMP) or 


will tend to be referenced again soon. 




uniform memory access (UMA) A multi- 


terabyte Originally 1 ,099,51 1 ,627,776 (2 40 ) 




processor in which accesses to main memo- 


bytes, although some communications and 




ry take the same amount of time no matter 


secondary storage systems have redefined it to 




which processor requests the access and no 


mean 1,000,000,000,000 (10 12 ) bytes. 




matter which word is asked. 


text segment The segment of a UNIX ob- 




synchronization The process of coordi- 


ject file that contains the machine language 




nating the behavior of two or more process- 


code for routines in the source file. 




es, which may be running on different 


three Cs model A cache model in which all 




processors. 


cache misses are classified into one of three 




synchronizer failure A situation in which a 


categories: compulsory misses, capacity 




flip-flop enters a metastable state and where 


misses, and conflict misses. 




some logic blocks reading the output of the 


tournament branch predictor A branch pre- 




flip-flop see a while others see a 1. 


dictor with multiple predictions for each 




synchronous bus A bus that includes a clock 


branch and a selection mechanism that chooses 




in the control lines and a fixed protocol for 


which predictor to enable for a given branch. 




communicating that is relative to the clock. 


trace cache An instruction cache that 




synchronous system A memory system 


holds a sequence of instructions with a given 




that employs clocks and where data signals 


starting address; in recent Pentium imple- 




are read only when the clock indicates that 


mentations the trace cache holds microoper- 




the signal values are stable. 


ations rather than IA-32 instructions. 




system call A special instruction that trans- 


track One of thousands of concentric cir- 




fers control from user mode to a dedicated 


cles that makes up the surface of a magnetic 




location in supervisor code space, invoking 


disk. 




the exception mechanism in the process. 


transaction processing A type of applica- 




system CPU time The CPU time spent in 


tion that involves handling small short op- 




the operating system performing tasks on 


erations (called transactions) that typically 




behalf of the program. 


require both I/O and computation. Trans- 
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action processing applications typically 


determined by the cause of the exception. 




have both response time requirements and a 


verilog One of the two most common 




performance measurement based on the 


hardware description languages. 




throughput of transactions. 


very large scale integrated (VLSI) 




transistor An on/off switch controlled by 


circuit A device containing hundreds of 




an electric signal 


thousands to millions of transistors. 




translation-lookaside buffer (TLB) A 


VHDL One of the two most common 




cache that keeps track of recently used ad- 


hardware description languages. 




dress mappings to avoid an access to the 


virtual address An address that corre- 




page table. 


sponds to a location in virtual space and is 




underflow (floating-point) A situation in 


translated by address mapping to a physical 




which a negative exponent becomes too 


address when memory is accessed. 




large to fit in the exponent field. 


virtual machine A virtual computer that 




units in the last place (ulp) The number of 


appears to have nondelayed branches and 




bits in errorin theleastsignificantbitsofthe 


loads and a richer instruction set than the 




significand between the actual number and 


actual hardware. 




the number that can be prepresented. 


virtual memory A technique that uses 




unmapped A portion of the address space 


main memory as a "cache" for secondary 




that cannot have page faults. 


storage. 




unresolved reference A reference that re- 


virtually addressed cache A cache that is 




quires more information from an outside 


accessed with a virtual address rather than a 




source in order to be complete. 


physical address. 




untaken branch One that falls through to 


volatile memory Storage, such as DRAM, 




the successive instruction. A taken branch is 


that only retains data only if it is receiving 




one that causes transfer to the branch target. 


power. 




user CPU time The CPU time spent in a 


wafer A slice from a silicon ingot no more 




program itself. 


than 0.1 inch thick, used to create chips. 




vacuum tube An electronic component. 


weighted arithmetic mean An average of 




predecessor of the transistor, that consists 


the execution time of a workload with 




of a hollow glass tube about 5 to 10 cm long 


weighting factors designed to reflect the 




from which as much air has been removed 


presence of the programs in a workload; 




as possible and which uses an electron beam 


computed as the sum of the products of 




to transfer data. 


weighting factors and execution times. 




valid bit A field in the tables of a memory 


wide area network A network extended 




hierarchy that indicates that the associated 


over hundreds of kilometers which can span 




block in the hierarchy contains valid data. 


a continent. 




vector processor An architecture and com- 


wire In Verilog, specifies a combinational 




piler model that was popularized by super- 


signal. 




computers in which high-level operations 


word The natural unit of access in a com- 




work on linear arrays of numbers. 


puter, usually a group of 32 bits; corre- 




vectored interrupt An interrupt for which 


sponds to the size of a register in the MIPS 




the address to which control is transferred is 


architecture. 
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workload A set of programs run on a com- 
puter that is either the actual collection of 
applications run by a user or is constructed 
from real programs to approximate such a 
mix. A typical workload specifies both the 
programs as well as the relative frequencies. 
write buffer A queue that holds data while 
the data are waiting to be written to memory. 
write-back A scheme that handles writes 
by updating values only to the block in the 
cache, then writing the modified block to 
the lower level of the hierarchy when the 
block is replaced. 



write-invalidate A type of snooping proto- 
col in which the writing processor causes all 
copies in other caches to be invalidated be- 
fore changing its local copy, which allows it 
to update the local data until another pro- 
cessor asks for it. 

write-through A scheme in which writes 
always update both the cache and the mem- 
ory, ensuring that data is always consistent 
between the two. 

yield The percentage of good dies from the 
total number of dies on the wafer. 
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Morse, S., B. Ravenal, S. Mazor, and W. Pohlman [ 1980]. "Intel Microprocessors— 8080 to 8086," Computer 
13:10 (October). 

The architecture history of the Intel from the 4004 to the 8086, according to the people who participated in the 
designs. 

Wakerly, J. [ 1 989] . Microcomputer Architecture and Programming, Wiley, New York. 

The Motorola 680x0 is the main focus of the book, but it covers the Intel 8086, Motorola 6809, TI 9900, and 
Zilog Z8000. 
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Chapter 3 

Burks, A. W., H. H. Goldstine, and J. von Neumann [1946]. "Preliminary discussion of the logical design of 
an electronic computing instrument," Report to the U.S. Army Ordnance Dcpt., p. 1; also in Papers of John 
von Neumann, W. Aspray and A. Burks, eds., MIT Press, Cambridge, MA, and Tomash Publishers, Los 
Angeles, 97-146, 1987. 

This classic paper includes arguments against floating-point hardware. 

Goldberg, D. [1991]. "What every computer scientist should know about floating-point arithmetic" ACM 
Computing Surveys 23(1), 5—48. 

Another good introduction to floating-point arithmetic by the same author, this time with emphasis on software. 

Goldberg, D. [2002]. "Computer arithmetic," Appendix A of Computer Architecture: A Quantitative 
Approach, third edition, J. L. Hennessy and D. A. Patterson, Morgan Kaufmann Publishers, San Francisco. 
(This appendix is online.) 

A more advanced introduction to integer and floating-point arithmetic, with emphasis on hardware. It covers 
Sections 3.4-3.6 of this book in just 10 pages, leaving another 45 pages for advanced topics. 

Kahan, W. [1972]. "A survey of error-analysis," in Info. Processing 71 (Proc. IFIP Congress 71 in Ljubljana), 
vol. 2, pp. 1214-39, North-Holland Publishing, Amsterdam. 

This survey is a source of stories on the importance of accurate arithmetic. 

Kahan, W. [1983]. "Mathematics written in sand" Proc. Amer. Stat. Assoc Joint Summer Meetings of 1983, 
Statistical Computing Section, pp. 12-26. 

The title refers to silicon and is another source of stories illustrating the importance of accurate arithmetic. 

Kahan, W. [1990]. "On the advantage of the 8087's stack," unpublished course notes, Computer Science 
Division, University of California at Berkeley. 

What the 8087 floating-point architecture could have been. 

Kahan, W. [1997]. Available via a link to Kahan's homepage at mvw.mkp.com/books_catalog/cod/Hnks.htm. 

A collection of memos related to floating point, including "Beastly numbers" (another less famous Pentium bug), 
"Notes on the IEEE floating point arithmetic" (including comments on how some features are atrophying), and 
"The baleful effects of computing benchmarks" (on the unhealthy preoccupation on speed versus correctness, 
accuracy, ease of use, flexibility, . . .). 

Koren, I. [2002] . Computer Arithmetic .Algorithms, second edition, A. K. Peters, Natick, MA. 

A textbook aimed at seniors and first-year graduate students that explains fundamental principles of basic 
arithmetic, as well as complex operations such as logarithmic and trigonometric functions. 

Wilkes, M. V. [ 1 985]. Memoirs of a Computer Pioneer, MIT Press, Cambridge, MA. 

This computer pioneers recollections include the derivation of the standard hardware for multiply and divide 
developed by von Neumann. 
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Chapter 4 

Curnow, H. J., and B. A. Wichmann [1976]. "A synthetic benchmark," The Computer J. 19 (1):80. 
Describes the first major synthetic benchmark, Whetstone, and how it was created. 

Flemming, P. J., and J. J. Wallace [1986]. "How not to lie with statistics: The correct way to summarize 
benchmark results" Comm. ACM 29:3 (March) 218-21. 

Describes some of the underlying principles in using different means to summarize performance results. 

McMahon, R M. [1986]. "The Livermore FORTRAN kernels: A computer test of numerical performance 
range" Tech. Rep. UCRL-55745, Lawrence Livermore National Laboratory, Univ. of California, Livermore 
(December). 

Describes the Livermore Loops — a set of Fortran kernel benchmarks. 

Smith, J. E. [1988]. "Characterizing computer performance with a single number," Comm. ACM 31:10 
(October) 1202-06. 

Describes the difficulties of summarizing performance with just one number and argues for total execution time 
as the only consistent measure. 

SPEC [2000] . SPEC Benchmark Suite Release 1.0, SPEC, Santa Clara, CA, October 2. 

Describes the SPEC benchmark suite. For up-to-date information, see the SPEC Web page via a link at 
www.mkp.com/books_catalog/cod/linksJitm. 

Weicker, R. P. [1984]. "Dhrystone: A synthetic systems programming benchmark," Comm. ACM 27:10 
(October) 1013-30. 

Describes the Dhrystone benchmark and its construction. 

Chapter 5 

A basic Verilog tutorial is included on the CD. There are also many books both on Verilog and on digital 
design using Verilog. 

Kidder, T. [ 1 98 1 ] . Soul of a New Machine, Little, Brown, and Co., New York. 

Describes the design of the Data General Eclipse series that replaced the first DG machines such as the Nova. 
Kidder records the intimate interactions among architects, hardware designers, microcoders, and project man- 
agement. 

Levy, H. M., and R. H. Eckhouse, Jr. [1989], Computer Programming and Architecture: The VAX, Second ed., 
Digital Press, Bedford, MA. 

Good description of the VAX architecture and several different microprogrammed implementations. 

Patterson, DA. [ 1983]. "Microprogramming ? Scientific American 248:3 (March) 36-43. 
Overview of microprogramming concepts. 
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Tucker, S. G. [1967] /'Microprogram control for the System/360," IBM Systems J. 6:4, 222-41. 
Describes the microprogrammed control for the 360, the first microprogrammed commercial machine. 

Wilkes, M. V. [1985]. Memoirs of a Computer Pioneer, MIT Press, Cambridge, MA. 

Intriguing biography with many stories about industry pioneers and the trials and successes in building early 
machines. 

Wilkes, M. V., and J. B. Stringer [ 1953] . "Microprogramming and the design of the control circuits in an elec- 
tronic digital computer," Proc Cambridge Philosophical Society 49:230-38. Also reprinted in D. P. Siewiorek, C. 
G. Bell, and A. Newell, Computer Structures: Principles and Examples, McGraw-Hill, New York, 158-63, 1982, 
and in "The Genesis of Microprogramming," in Annals of the History of Computing 8:1 16. 

These two classic papers describe Wilkes's proposal for microcode. 

Chapter 6 

Bhandarkar, D., and D. W. Clark [1991]. "Performance from architecture: Comparing a RISC and a CISC 
with similar hardware organizations," Proc. Fourth Conf. on Architectural Support for Programming Lan- 
guages and Operating Systems, IEEE/ACM (April), Palo Alto, CA, 310-19. 

A quantitative comparison of RISC and CISC written by scholars who argued for CISCs as well as built them; 
they conclude that MIPS is between 2 and 4 times faster than a V\X built with similar technology, with a mean 
of 2.7. 

Fisher, J. A., and B. R. Rau [1993]. Journal of ' Supercomputing (January), Kluwer. 

This entire issue is devoted to the topic of exploiting ILP. It contains papers on both the architecture and software 
and is a wonderful source for further references. 

Hennessy, J. L., and D. A. Patterson [2001]. Computer Architecture: A Quantitative Approach, third ed., San 
Francisco: Morgan Kaufmann. 

Chapters 3 and 4 go into considerably more detail about pipelined processors (over 200 pages), including super- 
scalar processors and VLIW processors. 

Jouppi, N. P., and D. W. Wall [1989]. "Available instruction-level parallelism for superscalar and superpipe- 
lined processors," Proc. Third Conf. on Architectural Support for Programming Languages and Operating Sys- 
tems, IEEE/ACM (April), Boston, 272-82. 

A comparison of deeply pipelined (also called superpipelined) and superscalar systems. 

Kogge, P. M. [ 1981 ]. The Architecture of Pipelined Computers, New York: McGraw-Hill. 
A formal text on pipelined control, with emphasis on underlying principles. 

Russell, R.M. [ 1978]. "The CRAY- 1 computer system," Comm. ofthe ACM 21:1 (January) 63-72. 
A short summary of a classic computer, which uses vectors of operations to remove pipeline stalls. 

Smith, A., and J. Lee [1984]. "Branch prediction strategies and branch target buffer design," Computer 17:1 
(January) 6-22. 

An early survey on branch prediction. 
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Smith, J. E., and A. R. Plezkun [1988]. "Implementing precise interrupts in pipelined processors," IEEE 
Trans, on Computers 37:5 (May) 562—73. 

Covers the difficulties in interrupting pipelined computers. 

Thornton, J. E. [1970]. Design of a Computer: The Control Data 6600, Glenview, IL: Scott, Foresman. 
A classic book describing a classic computer, considered the first supercomputer. 

Chapter 7 

Conti, C, D. H. Gibson, and S. H. Pitowsky [1968]. "Structural aspects of the System/360 Model 85, part 
I: General organization," IBM Systems /. 7: 1 , 2-1 4. 

A classic paper mat describes the first commercial computer to use a cache and its resulting performance. 

Jason E Cantin and Mark D. Hill [2001]. "Cache performance for selected SPEC CPU2000 benchmarks," 
SIGARCH Computer Architecture News, 29:4 (September), 13 - 18. 

A reference paper of cache miss rates for many cache sizes for the SPEC2000 benchmarks. 

Hennessy, J., and D. Patterson [2003]. Chapter 5 in Computer Architecture: A Quantitative Approach, Third 
edition, Morgan Kaufmann Publishers, San Francisco. 

For more in-depth coverage of a variety of topics including protection, cache performance of out-of-order proces- 
sors, virtually addressed caches, multilevel caches, compiler optimizations, additional latency tolerance mecha- 
nisms, and cache coherency. 

Kilburn, T., D. B. G. Edwards, M. J. Lanigan, and E H. Sumner [1962]. "One-level storage system," IRE 
Transactions on Electronic Computers EC-1 1 (April) 223-35. Also appears in D. P. Siewiorek, C. G. Bell, and 
A. Newell, Computer Structures: Principles and Examples, McGraw-Hill, New York, 1 35-48, 1 982. 

This classic paper is the first proposal for virtual memory. 

LaMarca, A. and R. E. Ladner [1996. "The influence of caches on the performance of heaps," ACM J. of 
Experimental Algorithmics, vol.1, wvvw.jea.acm.org/1996/LaMarcaInfluence/. 

This paper shoivs the difference between complexity analysis of an algorithm, instruction count performance, 
and memory hierarchy for four sorting algorithms. 

Przybylski, S. A. [1990]. Cache and Memory Hierarchy Design: A Performance-Directed Approach, Morgan 
Kaufmann Publishers, San Francisco. 

A thorough exploration of multilevel memory hierarchies and their performance. 

Ritchie, D.M. and K. Thompson. "UNIX Timesharing System: The UNIX Timesharing System." Bell System 
Technical Journal, August 1978, pp. 1991-2019. 

A paper describing the most elegant operating system ever invented. 

Ritchie, Dennis. "The Evolution of the UNIX Timesharing System." AT& T Bell Laboratories Technical Jour- 
nal, August 1984, pp. 1577-1593. 

The history of UNIX from one of its inventors. 
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Silberschatz, A., P. Garvin, and G. Grange[2003]. Operating System Concepts, sixth edition, Addison-Wesley, 
Reading, MA. 

An operating systems textbook with a thorough discussion of virtual memory, processes and process manage- 
ment, and protection issues. 

Smith, A. J. [ 1982]. "Cache memories," Computing Surveys 14:3 (September) 473-530. 

The classic survey paper on caches. This paper defined the terminology for the field and has served as a reference 
for many computer designers. 

Smith, D.K. and R.C. Alexander. Fumbling the Future: How Xerox Invented, Then Ignored, the First Personal 
Computer. New York: Morrow, 1988. 

A popular book that explains the role of Xerox PARC in laying the foundation for todays computing, which 
Xerox did not substantially benefit from. 

Tanenbaum, A. [2001 ]. Modern Operating Systems, second edition, Prentice Hall, Upper Saddle River, NJ. 
An operating system textbook with a good discussion of virtual memory. 

Wilkes, M. [1965]. "Slave memories and dynamic storage allocation," IEEE Trans. Electronic Computers EC- 
14:2 (April) 270-71. 

The first, classic paper on caches. 

Chapter 8 

Bashe, C. J., L. R. Johnson, J. H. Palmer, and E. W. Pugh [1986]. IBM's Early Computers, Cambridge, MA: 
MIT Press. 

Describes the I/O system architecture and devices in IBM's early computers. 

Brenner, P. [1997] . A Technical Tutorial on the IEEE 802. 1 1 Protocol found on many Web sites. 
A widely referenced short tutorial that outlives the startup company for which the author worked. 

Chen, P. M., E. K. Lee, G. A. Gibson, R. H. Katz, and D. A. Patterson [1994]. "RAID: High-performance, 
reliable secondary storage " ACM Computing Surveys 26:2 (June), 145—88. 

A tutorial covering disk arrays and the advantages of such an organization. 

Gray, J. [ 1 990] . "A census of Tandem system availability between 1 985 and 1 990," IEEE Transactions on Reli- 
ability 39:4 (October), 409-18. 

One of the first papers to categorize, quantify, and publish reasons for failures. It is still widely quoted. 

Gray, J., and A. Reuter [1993]. Transaction Processing: Concepts and Techniques, San Francisco: Morgan 
Kaurmann. 

A description of transaction processing, including discussions of benchmarking and performance evaluation. 

Hennessy, J., and D. Patterson [2003]. Computer Architecture: A Quantitative Approach, third ed., San Fran- 
cisco: Morgan Kaufmann Publishers, Chapters 7 and 8. 

Chapter 7 focuses on storage, including an extensive discussion ofR.MD technologies and dependability. Chapter 
8 focuses on networks. 
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Kahn, R. E. [1972]. "Resource-sharing computer communication networks," Proc. IEEE 60:11 (November), 
1397-1407. 

A classic paper that describes the,\RPANET. 

Laprie, J.-C. [1985]. "Dependable computing and fault tolerance: concepts and terminology," 15th Annual 
Int'I Symposium on Fault-Tolerant Computing FTCS 15, Digest of Papers, Ann Arbor, MI (June 19-21), 2- 
11. 

The paper that introduced standard definitions of dependability, reliability, and availability. 

Levy, J. V. [ 1978]. "Buses: The skeleton of computer structures," in Computer Engineering: A DEC View of 
Hardware Systems Design, C. G. Bell, J. C. Mudge, and J. E. McNamara, eds., Bedford, MA: Digital Press. 

This is a good overview of key concepts in bus design with some examples from DEC machines. 

Lyman, P., and H. R. Varian [2003], "How much information? 2003," http://www.sims.berkeley. 
edu/research/ projects/how-much -info- 2003/. 

This project estimates the amount of information in the world from all possible sources. 

Metcalfe, R. M., and D. R. Boggs [1976]. "Ethernet: Distributed packet switching for local computer net- 
works," Comm. ACM 19:7 (July), 395-404. 

A classic paper that describes the Ethernet network. 

Myer, T. H., and I. E. Sutherland [ 1 968] . "On the design of display processors," Communications of the ACM 
11:6 (June), 410-14. 

Another classic that notes how building pmverful coprocessors can be a never-ending cycle. 

Okada, S., Y. Matsuda, T. Yamada, and A. Kobayashi [ 1999]. "System on a chip for digital still camera," IEEE 
Trans, on Consumer Electronics 45:3 (August), 584-90.) 

Oppenheimer, D., A. Ganapathi, and D. Patterson [2003]. "Why do Internet services fail, and what can be 
done about it?,"4f/j Usenix Symposium on Internet Technologies and Systems, March 26-28, Seattle, WA. 

A recent update on Gray's classic paper, this time focused on Internet sites. 

Patterson, D., G. Gibson, and R. Katz [1988]. "A case for redundant arrays of inexpensive disks (RAID)," 
SIGMOD Conference. 109-16. 

A classic paper that advocates arrays of smaller disks and introduces RAID levels. 

Saltzer, J. H., D. P. Reed, and D. D. Clark [ 1984]. "End-to-end arguments in system design " ACM Trans, on 
Computer Systems 2:4 (November), 277-88. 

A classic paper that defines the end-to-end argument 

Smotherman, M. [ 1989] . "A sequencing-based taxonomy of I/O systems and review of historical machines," 
Computer Architecture News 17:5 (September), 5-15. 

Describes the development of important ideas in I/O. 

Talagala, N., R. Arpaci-Dusseau, and D. Patterson [2000]. "Micro-benchmark based extraction of local and 
global disk characteristics" U.C. Berkeley Technical Report CSD-99-1063, June 13. 

Describes a simple program to automatically deduce key parameters of disks. 
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Chapter 9 

Almasi, G. S., and A. Gottlieb [ 1989] . Highly Parallel Computing, Benjamin/Cummings, Redwood City, CA. 
A textbook covering parallel computers. 

Amdahl, G. M. [ 1967]. "Validity of the single processor approach to achieving large scale computing capa- 
bilities" Proc. .\FIPS Spring Joint Computer Conf., Atlantic City, NJ, (April) 483-85. 

Written in response to the claims of the llliac IV, this three-page article describes Amdahl's law and gives the 
classic reply to arguments for abandoning the current form of computing. 

Andrews, G. R. [1991]. Concurrent Programming: Principles and Practice, Benjamin/Cummings, Redwood 
City, CA. 

A text that gives the principles of parallel programming. 

Archibald, J., and J.-L. Baer [1986]. "Cache coherence protocols: Evaluation using a multiprocessor simula- 
tion model," ACM Trans, on Computer Systems 4:4 (November), 273-98. 

Classic survey paper of shared-bus cache coherence protocols. 

Arpaci-Dusseau, A., R. Arpaci-Dusseau, D. Culler, J. Hellerstein, and D. Patterson [1997]. "High- 
performance sorting on networks of workstations," Proc. ACM SIGMOD/PODS Conference on Management 
of Data, Tucson, AZ, May 12-15. 

Hcnv a world record sort was performed on a cluster, including architecture critique of the workstation and net- 
work interface. By April I, 1997, they pushed the record to 8.6 GB in 1 minute and 2.2 seconds to sort 100 MB. 

Bell, C. G. [ 1 985 ] . "Multis: A new class of multiprocessor computers," Science 228 (April 26) , 462-67. 
Distinguishes shared address and nonshared address multiprocessors based on microprocessors. 

Culler, D. E., and J. P. Singh, with A. Gupta [ 1998]. Parallel Computer ^-Architecture, Morgan Kaufmann, San 
Francisco. 

A textbook on parallel computers. 

Falk, H. [ 1976]. "Reaching for the Gigaflop," IEEE Spectrum 13:10 (October), 65-70. 

Chronicles the sad story of the llliac IV: four times the cost and less than one-tenth the performance of original 
goals. 

Flynn, M. J. [ 1966]. "Very high-speed computing systems," Proc. IEEE 54: 12 (December), 1901-09. 
Classic article showing SISD/SIMD/MISD/MIMD classifications. 

Hennessy, J., and D.Patterson [2003]. Chapters 6 and 8 in Computer Architecture: A Quantitative Approach, 
third edition, Morgan Kaufmann Publishers, San Francisco. 

A more in-depth coverage of a variety of multiprocessor and cluster topics, including programs and measure- 
ments. 

Hord, R. M. [1982]. The Illiac-lV, the First Supercomputer, Computer Science Press, Rockville, MD. 
A historical accounting of the llliac IV project. 
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Hwang, K. [ 1993]. Advanced Computer Architecture with Parallel Programming, McGraw-Hill, New York. 

Another textbook covering parallel computers. 

Kozyrakis, C, and D. Patterson [2003]. "Scalable vector processors for embedded systems," IEEE Micro 23:6 
(November-December), 36-45. 

Examination of a vector architecture for the MIPS instruction set in media and signal processing. 

Menabrea, L. F. [1842]. "Sketch of the analytical engine invented by Charles Babbage," Bibliotheque Uni- 
verselle de Geneve (October). 

Certainly the earliest reference on multiprocessors, this mathematician made this comment white translating 
papers on Babbage's mechanical computer. 

Pfister, G. F. [1998]. In Search of Clusters: The Coming Battle in Lowly Parallel Computing, second edition, 
Prentice- Hall, Upper Saddle River, NJ. 

An entertaining book that advocates clusters and is critical ofNUMA multiprocessors. 

Seitz,C. [1985]. "The Cosmic Cube" Comm.ACM 28:1 (January), 22-31. 

A tutorial article on a parallel processor connected via a hypertree. The Cosmic Cube is the ancestor of the Intel 

supercomputers. 

Slotnick, D. L. [1982]. "The conception and development of parallel processors — A personal memoir" 
Annals of the History of Computing 4:1 (January), 20-30. 

Recollections of the beginnings of parallel processing by the architect of the IlliacIV. 

Appendix A 

Sweetman, D. [ 1999]. See MIPS Run, Morgan Kaufmann Publishers, San Francisco, CA. 

A complete, detailed, and engaging introduction to the MIPS instruction set and assembly language program- 
ming on these machines. 

Detailed documentation on the MIPS32 architecture is available on the Web: 

MIPS32™ Architecture for Programmers Volume I: Introduction to the MIPS32™ Architecture 

(http://mips.com/content/Documentation/MIPSDocumentation/ProcessorArchitectiire/ 

MchitecmreProgrammingPublicationsforMIPS32/MD00082-2B-MIPS32INT-AFP-02.00.pdf/getDownload) 

MIPS32™ Architecture for Programmers Volume II: The MIPS32 nt Instruction Set 

(http://mips.com/content/Documentation/MIPSDocumentation/ProcessorArchitectitre/ 

ArchitecmreProgrammingPublicaHonsforMIPS32/MD00086-2B-MIPS32BIS-AFP-02.00.pdf/getDownload) 

MIPS32™ Architecture for Programmers Volume III: The MIPS32 ni Privileged Resource Mchitecture 

(http://mips.com/content/Documentation/MIPSDocumentation/ProcessorArchitectiire/ 

ArchitecmreProgrammingPublicationsforMIPS32/MD00090-2B-MIPS32PRA-.\FP-02M 

Aho, A., R. Sethi, and J. Ullman [1985]. Compilers: Principles, Techniques, and Tools, Addison -Wesley, 
Reading, MA. 

Slightly dated and lacking in cwerage of modern architectures, but still the standard reference on compilers. 
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Appendix B 

Ciletti, M. D. [2002] Advanced Digital Design with the Verilog HDL, Englewood Cliffs, NJ: Prentice-Hall. 
A thorough book on logic design using Verilog. 

Katz, R. H. [2004]. Modem Logic Design, second edition, Reading, MA: Addison Wesley. 
A general text on logic design. 

Wakerly, J. F. [2000] . Digital Design: Principles and Practices, third ed., Englewood Cliffs, NJ: Prentice-Hall. 
A general text on logic design. 



